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Preface 


Bioinformatics sits at the intersection of four major scientific disciplines: biology, mathe¬ 
matics, statistics, and computer science. That’s a very busy intersection, and many volumes 
would be required to provide a comprehensive review of the state-of-the-art methodologies 
used in bioinformatics today. That is not what this concise two-volume work of contributed 
chapters attempts to do; rather, it provides a broad sampling of some of the most useful and 
interesting current methods in this rapidly developing and expanding field. 

As with other volumes in Methods in Molecular Biology, the focus is on providing 
practical guidance for implementing methods, using the kinds of tricks and tips that are 
rarely documented in textbooks or journal articles, but are nevertheless widely known and 
used by practitioners, and important for getting the most out of a method. The sharing of 
such expertise within the community of bioinformatics users and developers is an important 
part of the growth and maturation of the subject. These volumes are therefore aimed 
principally at graduate students, early career researchers, and others who are in the process 
of integrating new bioinformatics methods into their research. 

Much has happened in bioinformatics since the first edition of this work appeared in 
2008, yet much of the methodology and practical advice contained in that edition remains 
useful and current. This second edition therefore aims to complement, rather than super¬ 
sede, the first. Some of the chapters are revised and expanded versions of chapters from the 
first edition, but most are entirely new, and all are intended to focus on more recent 
developments. 

Volume 1 is comprised of three parts: Data and Databases; Sequence Analysis; and 
Phylogenetics and Evolution. The first part looks at bioinformatics methodologies of crucial 
importance in the generation of sequence and structural data, and its organization into 
conceptual categories and databases to facilitate further analyses. The Sequence Analysis part 
describes some of the fundamental methodologies for processing the sequences of biological 
molecules: techniques that are used in almost every pipeline of bioinformatics analysis, 
particularly in the preliminary stages of such pipelines. Phylogenetics and Evolution deals 
with methodologies that compare biological sequences for the purpose of understanding 
how they evolved. This is a fundamental and interesting endeavor in its own right but is also 
a crucial step towards understanding the functions of biological molecules and the nature of 
their interactions, since those functions and interactions are essentially products of their 
history. 

Volume 2 is also comprised of three parts: Structure, Function, Pathways and Networks; 
Applications; and Computational Methods. The first of these parts looks at methodologies 
for understanding biological molecules as systems of interacting elements. This is a core task 
of bioinformatics and is the aspect of the field that attempts to bridge the vast gap between 
genotype and phenotype. The Applications part can only hope to cover a small number 
of the numerous applications of bioinformatics. It includes chapters on the analysis of 



genome-wide association data, computational diagnostics, and drug discovery. The final 
part describes four broadly applicable computational methods, the scope of which far 
exceeds that of bioinformatics, but which have nevertheless been crucial to this field. 
These are modeling and inference, clustering, parameterized algorithmics, and visualization. 


Melbourne , VIC, Australia 


Jonathan M. Keith 
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Chapter 1 


3D Computational Modeling of Proteins Using Sparse 
Paramagnetic NMR Data 

Kala Bharath Pilla, Gottfried Otting, and Thomas Huber 

Abstract 

Computational modeling of proteins using evolutionary or de novo approaches offers rapid structural 
characterization, but often suffers from low success rates in generating high quality models comparable to 
the accuracy of structures observed in X-ray crystallography or nuclear magnetic resonance (NMR) 
spectroscopy. A computational/experimental hybrid approach incorporating sparse experimental restraints 
in computational modeling algorithms drastically improves reliability and accuracy of 3D models. This 
chapter discusses the use of structural information obtained from various paramagnetic NMR measure¬ 
ments and demonstrates computational algorithms implementing pseudocontact shifts as restraints to 
determine the structure of proteins at atomic resolution. 

Key words Pseudocontact shifts, PCS, Paramagnetic NMR, Rosetta, GPS-Rosetta, Sparse restraints, 
3D structure determination 


1 Introduction 


Nuclear magnetic resonance (NMR) spectroscopy has for decades 
facilitated structure determination in solution or solid-state. NMR 
exploits the nuclear spin properties in strong constant magnetic 
fields. The nuclear spins are manipulated by radiofrequency pulses 
and their free induction decay is recorded. These are then Fourier 
transformed to produce a frequency spectrum of the NMR experi¬ 
ment. Two spins that are close in space have a direct magnetic 
interaction between them, referred as dipole-dipole coupling. 
When these two spins are aligned, the interaction energy becomes 
minimal resulting in nuclear Overhauser effect (NOE). Intermo- 
lecular and intramolecular NOEs are observed for spins that are 
typically separated by 3-6 A. By resolving a dense network of NOEs 
[1], the 3D structures of proteins and nucleic acids can be deter¬ 
mined. This conventional method is relied upon in structure deter¬ 
mination of a large number of proteins; however, assigning spin 
resonances of all spins in the system typically requires various 3D or 
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4D NMR experiments to be applied. In addition, with increasing 
molecular weight of proteins, they tend to produce poor spectra 
and determining 3D structures becomes increasingly difficult. 

As an alternative to short range restraints using NOEs, para¬ 
magnetic NMR generates versatile structural restraints. Proteins 
carrying paramagnetic metal ions induce significant effects in 
NMR experiments. These effects arise from the unpaired electrons 
of the paramagnetic metals, as electrons have a magnetic moment 
that is three orders of magnitude larger than that of a proton. 
Metalloproteins, which make up to 25 % of proteins in any organ¬ 
ism’s proteome [2], offer natural metal centers that potentially can 
be directly exploited in paramagnetic NMR experiments. Further, 
Mn 2+ , Fe2 + , Cu 2+ , and Co 2+ are naturally paramagnetic and 
found in native biological samples. 

Lanthanide ions are highly useful for paramagnetic NMR 
experiments, as their paramagnetism varies greatly while their phys¬ 
icochemical properties are highly similar. This makes it possible for 
different lanthanides to be used interchangeably in different NMR 
experiments [3]. Proteins that lack a natural metal center can be 
engineered to carry lanthanides. Figure 1 illustrates different ways 
to introduce metal ions into proteins. Small peptides, containing 
12-18 residues, are designed to bind lanthanide ion to their side 
chain atoms and these peptides are attached to either a thiol- 
reactive cysteine or at an N- or C-terminus of a protein [4]. The 
most popular means of attaching lanthanide ions is through metal 
chelating chemical tags. These chemical tags are site specifically 
attached either through cysteine ligation or more recently using 
unnatural amino acids which can be reacted via bio-orthogonal 
click chemistry [5]. Several reviews [6-9] provide a comprehensive 
overview of the chemistries to functionalise proteins with lantha¬ 
nide tags. 



Fig. 1 Illustration of various modes to introduce metal ions into proteins, (a) Replacing a native metal with 
paramagnetic lanthanide ion in metalloproteins. (b) Lanthanide binding peptides attached at C-terminus of a 
protein, (c) Lanthanide carrying chemical tag site specifically attached to a cysteine 
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1.1 Paramagnetic 
Effects in NMR 


1.1.1 Pseudocontact 
Shift (PCS) 


The unpaired electrons in a paramagnetic metal ion strongly inter¬ 
act with nuclear spins and the NMR spectrum changes due to 
induced paramagnetic effects. These paramagnetic effects are quan¬ 
tified by comparing with a diamagnetic (reference) spectra and then 
translated into structural restraints. The resulting structural 
restraints can be either distance dependent or orientation depen¬ 
dent or both. One can measure four distinct paramagnetic obser¬ 
vables from NMR experiments, namely: 

PCS is a contribution to the chemical shift experienced by a spin 
caused by the presence of centers of unpaired electrons. PCS of a 
nucleus influenced by a paramagnetic center can be calculated from 
a Aj-tensor, shown in Fig. 2a, given by: 


A 


B 


C D 



Fig. 2 The four distinct paramagnetic effects represented geometrically, (a) The pseudocontact shift (PCS) 
between metal center (M) and amide hydrogen (H). (b) The residual dipolar coupling (RDC) between two spins 
H and N. (c) The Paramagnetic relaxation enhancement (PRE) between m and H. (d) The cross correlation 
between Curie spin and dipole-dipole relaxation (CCR) between m and H. (e) Measurement of the four different 
paramagnetic effects, illustrated with two 1D undecoupled spectra, showing the diamagnetic and paramag¬ 
netic antiphase doublets. PCS is measured as the change in chemical shift between paramagnetic and 
diamagnetic states. RDC is measured as the difference in line splitting. PRE and CCR can be determined from 
the differential line broadening. Adapted from Schmitz (2009) [49] 
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1.1.2 Residual Dipolar 
Coupling (RDC) 


1.1.3 Paramagnetic 
Relaxation Enhancement 
(PRE) 


1.1.4 Cross Correlated 
Relaxation (CCR) 


3 

A /ax(3cOS 2 6>MH - l) + 2 A ^rh sin ^MH COS 2(p MH 

(1) 

where, r, 6 , cp define the polar coordinates of the nuclear spin with 
respect to principal axis of the Aj-tensor (centered on the paramag¬ 
netic ion) and A/ ax? Aj rh define the axial and rhombic component 
of the magnetic susceptibility tensor % and Aj-tensor is defined as %- 
tensor minus its isotropic component [10]. PCS is measured as 
change in the chemical shift of a spin’s paramagnetic and diamag¬ 
netic states, illustrated in Fig. 2e. 

Presence of paramagnetic metal weakly aligns the protein to an 
external magnetic field resulting in observable RDCs, which are 
manifested as an increase or decrease in magnitude of multiplet of 
splits that can be observed in undecoupled spectra, illustrated in 
Fig. 2e. The RDC is given by Eq. (2) shown in Fig. 2b: 

^ 

NH " 15kT8^ NU (2) 

A/ ax (3cos 2 6>NH - 1) +|, A/ rh sin 2 6> NH ,cos2^ NH 

where B 0 is the magnetic field strength, y H and y N are the gyro- 
magnetic ratios of the proton and nitrogen spin, h = h/2n with 
h being Planck’s constant, r NH is the distance between the nitrogen 
and proton nuclei [11]. 


PCS' alc = 


12# ^mh 


PREs give distance restraints between the paramagnetic lanthanide 
and spin of interest from peak intensity ratios between paramag¬ 
netic and diamagnetic states (Fig. 2e). The PRE is given by Eq. (1) 
shown in Fig. 2c. 


with, 


2 pre 



3r r \ 

i+^y 


( 3 ) 


1 

5 V4 n) ( 3k h T) 2 

where T r is the rotational correlation time, is the Larmor fre¬ 
quency of the proton, // 0 is the vacuum permeability,^ the g-factor, 
//b the Bohr magneton, and/the total spin moment [11]. 

This effect is measured by comparing the line width between the 
two components of the antiphase doublet (Fig. 2e) [ 11 ]. This effect 
combines distance and angle dependence given by Eq. (3) shown in 
Fig. 2d. 
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1.2 Structural 
Information from 
Paramagnetic Effects 


1.2.1 Uniqueness of PCS 


, ? ccr = K 3 cos 2 >/ — 1 


4r r + 


3r r 


1 + 


with. 


k=- 

30 V4 n) 


2 ^o/h(^b) V 2 a+i ) 2 


( 3^ b T) 


( 5 ) 


( 6 ) 


RDCs, which are defined from the molecular alignment tensor 
(Eq. 2, Fig. 2b), give the orientation of spin pairs relative to the 
external magnetic field in a distance independent fashion. RDCs by 
themselves can be directly used to determine the structure of small 
proteins only when a large number of experimental RDCs are 
available. Measurement of heteronuclear RDCs becomes difficult 
for proteins that exhibit limited solubility or produce broad NMR 
line widths due to tag mobility. 

PREs on the other hand give distance information from the 
paramagnetic center (Eq. 1, Fig. 2c). PREs induced by lanthanide 
ions range up to 20 A, but the effect is heavily influenced by the 
motion of the metal carrying tag [12]. Direct usage of PREs in 
structure determination is limited but chemically inert paramag¬ 
netic probes when added as co-solvents can be quantitatively used 
to characterize interfaces in protein-protein complexes. 

In comparison to RDCs and PREs, PCSs are the most potent 
structural restraints. A PCS defined by the A/- tensor is both orien¬ 
tation and distance dependent (Eq. 1, Fig. 2a). The PCS effect has 
the longest range among all the paramagnetic effects and extends 
up to 80 A (40 A from the paramagnetic center) and can be 
precisely measured even at low protein concentrations (<20 pM) 
[13]. It can be easily seen from the Aj-tensor defined in Eq. (1) 
that PCS influenced by a spin is proportional to r~ 3 from the metal 
center, which decays slower with distance than PRE with r~ 6 
dependence (Eq. 1). RDCs, in stark contrast to PCSs and PREs, 
are only orientation dependent (Eq. 2) brought about by the weak 
alignment from the inserted paramagnetic metal [8]. 

Experimentally PCSs are easy to measure in proteins by taking 
the difference in chemical shifts of a protein’s paramagnetic and 
diamagnetic states from simple 2D NMR spectra (shown in 
Fig. 3a). PCS can also be measured with higher accuracy and 
sensitivity compared to other paramagnetic effects, such as measur¬ 
ing coupling constants between nuclei for RDCs and measuring 
peak intensities for PREs. The induced PCS described within the 
Aj-tensor can be visualized as isosurfaces of constant PCS (shown 
in Fig. 3b). The Aj-tensor is fully defined by eight parameters, the 
origin of the tensor frame which coincides with the coordinates of 
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A 




F72 * 


• P F111 F138 « 

* 

FI 24 * 

o * 


F79 - ^48 


11 

10 9 8 

52(’H)/ppm 

7 6 


B 



Fig. 3 Measurement of pseudocontact shift (PCSs) and display of PCS as isosurfaces, (a) An illustration of three 
superimposed 15 N-HSQC spectra, showing the chemical shift changes due to presence of paramagnetic metal 
ions in the protein. Black resonances come from the diamagnetic reference (Y 3+ ) sample, while red (Dy 3+ ) and 
magenta (Er 3+ ) resonances show chemical shift changes due to the paramagnetic lanthanide ions attached in 
the sample, (b) Visualization of induced PCS as isosurfaces calculated from the A/-tensor 


the metal (#,$ 2 ), orientation of Aj-tensor frame (three Euler angles 
a, /?, y) with respect to the coordinate frame of the protein, and two 
components of the Aj-tensor, Aj ax (axial) and A/ r ^ (rhombic). To 
solve for the full mathematical description of a A /-tensor one needs 
to measure a minimum of eight PCSs. 


2 PCSs in Protein Structure Characterization 

2.1 Paramagnetic Accurate assignment of resonances in the NMR spectrum is the 
NMR Spectrum essential first step in extracting restraints. Especially for large pro- 

Assignment teins (>20 kDa), assignment of multidimensional NMR spectra 

becomes increasingly difficult due to spectral overlap and increased 
transverse relaxation of spins. If 3D atomic coordinates of nuclear 
spins are known, the NMR resonance assignments of both para¬ 
magnetic and diamagnetic spectra can be assigned with software 
algorithms. Several software algorithms are available to assist with 
NMR assignments, including Numbat [14], Possum [15], Echidna 
[16], and PARAssign [17]. 
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2.2 Protein-Ligand 
interactions 


2.3 Protein-Protein 
Compiexes 


2.4 Protein Structure 
Refinement 


PCSs can be measured not only on the protein’s nuclear spins but 
also on the spins of the bound ligands. With the availability of a 
diverse range of metal binding chemical tags, the orientation and 
location of the ligand can be easily identified with the help of PCSs 
[3, 9]. This ability has major implications for rational drug design. 
John et al. [18] have demonstrated this concept using E. coli\ 
6*186/0 (a natural lanthanide binding protein) in complex with 
the ligand thymidine, where the ligand affinity and its binding 
orientation was entirely determined using only PCSs. Saio et al. 
[19] showed that a combination of PCSs and PREs generated from 
two point anchored lanthanide binding peptide can be used to 
screen for ligands for protein Grb2. Guan et al. [20] showed that 
even in the absence of isotope labeled protein samples, the location 
of the ligand bound to the protein can be determined in low 
resolution with predicted A /-tensor parameters. 

Protein-protein complexes are fundamental to the function of 
cellular signaling and function. If 3D structures of the interacting 
protein partners are known, then the directionality and distance 
dependence of the A /-tensor can be exploited in docking the 
interacting partners in the right orientation. Pintacuda et al. [21] 
reported the first demonstration of the use of PCSs to compute the 
structure of a protein complex, using the interacting partners of E. 
coli DNA polymerase complex’s N-terminal domain of the subunits 
e and 6. Recent studies involving a large PCS data set (446 PCSs) 
have been used to characterize cytochrome P450cam in complex 
with putidaredoxin using double cysteine anchored tag [22]. PCS 
restraints are incorporated into protein-protein docking program 
Haddock, where the orientation of interacting partners and A/~ 
tensors are simultaneously fitted for finding optimized interacting 
surfaces [23]. 

If the coordinates of atoms in the protein are known, PCSs can be 
effectively used to refine protein structures. Allegrozzi et al. [24] 
showed that NOE derived structural models can be further refined 
using PCSs that are measured using three different lanthanides 
(Ce 3+ , Yb 3+ , and Dy 3+ ), which have different coverage range 
over the protein. Supplementing PCS restraints on the protein 
calbindin decreased the overall RMSD over NOE derived NMR 
structures. Gaponenko et al. [25] showed that using PCS data 
generated from three different lanthanide attachment sites 
extended the refinement approach to proteins larger than 30 kDa. 
PCS refined structures showed improvement over an Angstrom 
RMSD when compared to NOE only structures, and this improved 
accuracy is also validated using RDCs. Other paramagnetic 
restraints also have been used in a similar manner. Sparse datasets 
of RDCs combined with sparse NOEs have been used to identify 
the best models from a pool of structures generated using homol¬ 
ogy modeling [26] and de novo methods [27]. 
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To directly use PCSs for structure calculation is challenging as 
one needs to determine the eight parameters to describe the Aj- 
tensor, which are difficult to estimate as they depend on the chemi¬ 
cal environment of the metal. Without the knowledge of 3D coor¬ 
dinates of the protein it is not possible to fit the A/-tensor to 
reproduce the experimentally observed PCSs. However, one can 
use PCSs as restraints in de novo structure prediction methods such 
as Rosetta [28]. Rosetta’s forcefield accurately describes the protein 
state and the software algorithms are designed to robustly search 
the conformational space accessible to the protein. 


3 Protein Structure Determination Using PCS and Rosetta 

Incomplete or sparse structural data generated from NMR experi¬ 
ments can be used as structural restraints in Rosetta calculations to 
facilitate structure determination. Unlike traditional methods 
where structure calculation is mainly determined by the complete¬ 
ness of experimental data which defines the position of atomic 
coordinates in a protein structure, the sparse NMR data is used to 
guide the conformational search which directs the sampling 
towards the global minimum. Different types of NMR measure¬ 
ments have been incorporated as additional scoring restraints in 
Rosetta. Chemical shift measurements combined with predicted 
backbone dihedrals and secondary structure elements can be used 
in picking fragments that match the prediction, a procedure known 
as CS-Rosetta [29, 30]. Backbone NOEs in combination with 
RDCs have also been included in protein structure determination 
[28] using an advanced genetic algorithm [31]. Incorporation of 
sparse NMR data in structure calculations has been shown to 
improve protein structure predictions. 

3.1 Rosetta Structure Based on folding studies of small proteins, Rosetta’s algorithms are 

Calculation Algorithm built on the assumption that the ensemble of local structures sam¬ 
pled by a sequence fragment can be approximated by a small 
number of local structures that a similar fragment adopts in 
known protein structures [32]. For a given protein sequence 
whose structure is to be determined, the sequence is decomposed 
into overlapping windows of nine and three residues. The fragment 
libraries are constructed for each of the nine and three residue 
windows by searching through 3D structure databases for protein 
fragments whose sequences or secondary structures have high sim¬ 
ilarity to that of the query. The corresponding backbone dihedral 
angles of the matched protein fragments are bundled up into 
fragment libraries The search for the lowest energy structure is 
carried out by assembling the fragments into protein-like structures 
using Metropolis Monte-Carlo and simulated annealing algorithms 
[33]. Starting from a linear polypeptide, the search is carried out in 
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Fig. 4 Illustration of Rosetta’s ab initio fragment assembly, (a) A protein decoy in an intermittent state during 
fragment assembly. The backbone atoms are shown in a cartoon representation and the side chain atoms are 
represented as spheres attached to Cp atoms. The hydrophobic residues represented in grey and the solvent 
accessible residues represented in green, (b) Final fold of the protein shown in (a) after the low resolution 
fragment assembly, (c) All-atom representation of the final fold of the protein with complete side-chain atoms 


two distinct phases, a low resolution centroid mode and a high 
resolution all-atom mode [34]. 

3.1.1 Centroid Mode In this mode, the conformational search is carried out in a low- 

resolution phase, in which the amino acid residues are represented 
in a stripped down version that lacks complete side chain detail. The 
side chains are represented as spheres attached to the backbone (Cp 
and beyond) at their centroid point as shown in Fig. 4a. The 
fragment assembly follows Monte-Carlo moves starting from an 
arbitrary position from a random nine residue fragment window. 
For every move, which replaces the coordinates of a protein seg¬ 
ment from that of a fragment library, the energy of the resultant 
protein decoy is evaluated. The scoring function in the centroid 
phase is a coarse-grained description of probabilistic functions 
which favors the formation of globular compact structures. This 
scoring function explicitly scores for electrostatic and solvation 
effects among residues which are based on the observed distribu¬ 
tions in known proteins. Formation of secondary structural ele¬ 
ments in the folding pathway is encouraged with distinct function 
terms that favor helix-helix, helix-sheet, and sheet-sheet pairing. 
This low resolution centroid mode generates protein like decoy 
structures, in which the polar amino acids are exposed to the 
solvent while burying the hydrophobic residues in the core of the 
protein (shown in Fig. 4b). Multiple folding pathways are indepen¬ 
dently sampled, generating tens to hundreds of thousands of pro¬ 
tein decoy structures to sample the vast conformational space. 

3.1.2 All-Atom Mode This mode generates complete and optimized placement of side- 

chain coordinates (shown in Fig. 4c). Here side chains are modeled 
by searching through discrete combinations of amino acid rotamers 
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3.1.3 PCS Restraints in 
Rosetta 


3.2 Extending PCS 
Scoring to Multipie 
Metal Centers 


by simulated annealing. To further optimize the geometry, multistep 
Monte Carlo minimisation is enforced on each decoy; steps include 
torsion angle perturbations, one-at-a-time rotamer optimization and 
continuous gradient based minimisation of backbone torsion angles 
and side chain coordinates. The scoring function during this stage is 
more detailed, physically realistic, accurate to the atomic level and 
computationally expensive. Hydrogen bonding is explicitly included 
in the analysis. Hydrogen bonding terms are knowledge-based terms 
which are orientation and secondary structure dependent and were 
derived from high resolution protein structures. Typically, multiple 
independent trajectories are first clustered and atomic details are 
generated on the desired cluster [33]. 

In the centroid mode, at each instance of a fragment move, Aj- 
tensors are fitted to the assembled structure and PCSs are back- 
calculated. The difference between the input and back-calculated 
PCSs are then used as a quality score to guide assembly to the right 
fold of the protein. It has been shown that using PCSs from a single 
metal center, 3D protein structures up to 150 amino acid residues 
can be determined at atomic resolution [35]. However, this 
method is limited in its application for proteins larger than 150 
amino acids. 

The primary limitation associated with the PCSs measured 
from a single metal center is the reduction in quality of PCS data. 
Lanthanide tags attached to a single metal center often fail to 
induce significantly large PCS for most of the spins in the protein. 
This loss of data is pronounced in large molecular weight proteins. 
Secondly, there is additional loss of data due to induced PRE effect 
by the lanthanide ions, where NMR signals of the spins near the 
vicinity of the lanthanides are broadened beyond detection. 

To resolve the ambiguities associated with the PCS data generated 
from a single metal center and to achieve complete coverage, the 
approach has been extended from a single metal center to multiple 
metal centers. A second PCS measured for the same nucleus from a 
lanthanide attached at a different site restricts the spin to lie on 
intersecting isosurfaces. A third PCS measured from a lanthanide 
attached at a site different from the first two would further restrict 
the location of the spin in space. This technique, which is analogous 
to the method of finding a location on Earth from three or more 
GPS satellites, is incorporated into the Rosetta framework and was 
dubbed GPS-Rosetta [36]. 


3.2.1 Scoring PCS Data 
from Multiple Metal 
Centers in Rosetta 
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The A/-tensor from Eq. (1) can be rewritten as 



A Xxx A Xxy A Xxz 
A Xxy A Xyy A Xyz 
A X XZ A Xyz A XzZ 


(7) 


where, r* is the distance between the spin i and the paramagnetic 
center M\ and Zi are the Cartesian coordinates of the vector 

between the metal ion and the spin i in an arbitrary frame / and 
A Xxx, A Xyy> A Xzz, A Xxy> A Xxz, and A Xyz are the A/-tensor components 
in the frame / (as A/ zz = ~ A Xxx ~ A Xyy> there are only five inde¬ 
pendent parameters). The determination of PCS ? calc (Eq. 5) poses a 
nonlinear least-square fit problem, which can be divided into its 
linear and nonlinear parts. PCS ? calc is linear with respect to the five 
A/-tensor components which can be optimized efficiently using 
singular value decomposition. With the knowledge of the location 
of the chemical tag used, search over the metal coordinates Am, 
and z M of the paramagnetic center can be carried out on a 3D grid. 
The 3D grid is defined with parameters which include center of the 
grid search step size between two nodes (sjj), an outer cutoff 
radius ( co) which limits the search to a minimal distance from eg and 
an inner cutoff radius (ci) to avoid a search too close to eg [35]. 

PCSs recorded from multiple lanthanide carrying chemical tags 
are given as input into Rosetta by constructing multiple 3D grids 
for individual tag site. For each PCS dataset per metal and chemical 
tag, the Aj-tensor components are fitted at each node of the 3D 
grid and the PCSs are back-calculated. The grid node with the 
lowest score obtained from Eq. (6) is then taken as the starting 
point to further optimize the metal position and the five compo¬ 
nents of the Aj-tensor to reach the minimum cost for all the metal 
centers. 




where m is the number of PCS data sets (one dataset per metal ion) 
per binding site k and n pcs is the number of PCSs in the dataset. A 
total weighted sum of square deviations are used as PCS scoring 
Stotai an d added to the low-resolution energy function of Rosetta: 


n 



(9) 


where n is the total number of metal binding centers and w denotes 
the weighting factor relative to the Rosetta ab initio scoring func¬ 
tion. The weighting factor w for each of the n centers was calculated 
independently by 
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W> = (10) 

\ ^high ^low / 

where % igh and a\ ow are the averages of the highest and lowest 10 % 
of the values of the Rosetta ab initio score, and r high and q ow are the 
averages of the highest and lowest 10 % of PCS score obtained by 
rescoring 1000 decoys with unity weighting factor. 


4 The GPS-Rosetta Algorithm 

The algorithm incorporating PCS scoring from multiple metal 
centers in Rosetta’s structure determination protocol is described 
as a flow chart in Fig. 5. The Aj-tensors for each dataset from 
multiple sites are simultaneously optimized and the weighted PCS 
scores for individual metal sites are added to the centroid scoring 
function. Side chain atoms are then added to all the structural 
decoys and scored using Rosetta’s all-atom scoring function. The 
PCS scoring is not used in this mode, because only minor changes 
in the backbone structure are generated. The side chain optimized 
structural models are rescored with PCS data from multiple metal 
centers with new weights generated using Eq. (8), except that they 
are now weighted against Rosetta’s all-atom scoring function. The 
top structures are selected based on lowest combined scores of 
Rosetta’s all-atom score and weighted PCS score from all the tag 
sites. 

GPS-Rosetta protocol has been implemented in determining 
3D structures from PCSs data generated from two different NMR 
experiments, solution state NMR and magic angle spinning (MAS) 
solid-state NMR experiments. C-terminal domain of endoplasmic 
reticulum protein 29, ERp29-C (106 residues) from rat, is deter¬ 
mined from the PCS data generated at 4 different metal centers in 
solution state and Immunoglobulin Binding Domain of Protein G, 
GB1 (56 residues) from Streptococcus spp, is determined from PCS 
data generated at three different metal centers in microcrystalline 
state. 

4.1 Fold ERp29-C is a chaperone protein expressed in the endoplasmic 

Determination Using reticulum of a mammalian cell, where it facilitates the folding and 

PCSs from Solution transport of other protein molecules. The 3D structure was first 

NMR Experiments determined by solution NMR using a conventional NOE approach, 

and the result is referred to as the NOE structure [37]. However, 
the crystal structure of human ERp29-C [38] shows a significantly 
different fold with Coc root mean squared deviation (RMSD) of 
4.5 A when compared to the NOE structure. GPS-Rosetta protocol 
was employed to reassess the structure in solution [36]. Four 
different sites on the protein were chosen to bind two different 
lanthanide tags. The cysteine ligated, Cl tag [39] was chosen to 
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Fig. 5 Flowchart illustrating series of steps involved in running GPS-Rosetta protocol, (a) Short nine and three 
residue fragments are generated based on target sequence and secondary structure prediction based on 
backbone chemical shifts, (b) PCS weights are calculated using Eq. (8). (c) Centroid models are generated by 
fragment assembly following Metropolis Monte-Carlo sampling algorithm. PCS scores for individual tag sites 
are independently optimized and the PCS scores are added to Rosetta’s scoring function, (d) Side chain 
generation and optimization to centroid models, (e) PCS score for individual tag sites are reweighted from all¬ 
atom models, (f) Models are rescored with PCSs and Rosetta’s all-atom scoring function, (g) Final structure is 
selected based on lowest combined score value 
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bind at the native cysteine (057) and IDA-SH tag [40] was 
attached at double mutants S200C/K204D, A218C/A222D, 
and Q241C/N245D. All the double mutations were on oc-helices 
and the aspartate residue at (i + 4)th position forming a specific 
lanthanide binding site. The side chain carboxyl-oxygen of the 
aspartate served as an additional coordination site to immobilize 
the lanthanide ion. The PCS dataset from eight paramagnetic 
samples is composed of a total of 212 PCSs measured using lantha¬ 
nides Tb 3+ , Tm 3+ , and Y 3+ , where Y 3+ served as diamagnetic 
reference. 

The unique coordination feature of IDA-SH enabled determi¬ 
nation of the position of the metal ion at 5.9 A from the Coc of 
(i + 4)th residue, lying on a vector that joins the backbone amide 
nitrogen at (i + 6) and Coc of (i + 4)th aspartate. The lanthanide 
position defined by Cl tag at Cl57 was dynamically optimized 
during the folding simulation. More than 100,000 all-atom models 
were generated using GPS-Rosetta protocol and multiple struc¬ 
tures satisfying combined Rosetta and PCS score and experimental 
data were selected. The final structure was selected for the model 
that has the lowest Rosetta’s all-atom and weighted PCS energy. 
The final selected structure, which is represented by the red point in 
Fig. 6a, has a backbone Coc RMSD of 2.4 A to the crystal structure 
(Fig. 6b) [PDBID: 2QC7;[38]], and is referred to as the GPS- 
Rosetta model. The top five structures that are lowest in PCS 
RMSD are shown in blue points and the top five models with an 
arbitrary low combined score and low PCS RMSD are represented 
as green points (Fig. 6a). The GPS-Rosetta structure was compared 
against the crystal structure and top 10 selected structures. Super¬ 
position structures with low PCS RMSD are represented in shades 
of blue (Fig. 6b), and low scoring in PCS and Rosetta energy and 
low PCS RMSD are represented in shades of green (Fig. 6c). The 
Coc RMSD of all the selected structures lies in the range 2.0-2.9 A 
to the crystal structure with the exception of small variations in the 
orientation of the C-terminal residues which were reported to be 
disordered [37]. The GPS-Rosetta structure, in red, (PDBID: 
2M66) clearly resembles the crystal structure more closely 
(Fig. 6b, RMSD of 2.4 A) than the NOE structure (Fig. 6d, 
RMSD 6 A), effectively overruling it. 

4.2 High Resolution 
Protein Structure 
Determination Using 
PCSs from MAS Solid 
State NMR 
Experiments 


MAS solid-state NMR spectroscopy has been routinely employed 
to determine structure of membrane biomolecules and proteins 
that are difficult to study by solution NMR or X-ray crystallography 
[41]. 3D structures are determined by resolving large number of 
dipolar couplings between X H, 13 C, and 15 N nuclei [42, 43]; 
however, the spectrum resolves in densely packed cross-peaks 
which are highly difficult to assign. Moreover, the peaks arising 
from long range correlations in dipolar couplings produce low 
signal-to-noise ratio and the time required to acquire a 2D spectra 
is several days [44, 45]. 
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Fig. 6 Structure determination using GPS-Rosetta protocol for ERp29-C. (a) Combined score of weighted PCS 
and Rosetta energy is plotted against the PCS RMSD for each of the 100,000 generated structures. The final 
selected structure is represented in red has the lowest combined score. Structures with lowest PCS RMSD are 
represented in blue and the models with an arbitrary low combined score and low PCS RMSD are represented 
in green, (b) Superimposed cartoon representations of top structures selected using the GPS-Rosetta protocol. 
The crystal structure [PDBID: 2QC7] is shown in grey and the GPS-Rosetta structure is represented in red has 
2.4 A Ca RMSD to the crystal structure (residues 158-228 and 230-244). Top five models with low PCS RMSD 
represented in shades of blue have a Ca RMSD range of 2.0-2.9 A to the crystal structure (residues 158-228 
and 230-244). (c) Top five models with low PCS and Rosetta energy and also low in PCS RMSD are 
represented in shades of green have a Ca RMSD range of 2.2-2.6 A to the crystal structure (residues 
158-228 and 230-244). (d) The NOE structure [PDBID: 1G7D] represented in yellow has Ca RMSD of 6 A to 
the GPS-Rosetta structure (residues 158-244) represented in red 

Here we demonstrate the implementation of PCSs recorded in 
solid state for structure calculation. GB1 protein (56 amino acids) 
served as a model system. GB1 was covalently ligated to 4-mercap- 
tomethyl-dipicolinic acid (4MMDPA) tag [46] at three different 
sites by generating three cysteine mutants at K28C, D40C, and 
E42C. The tags were loaded with paramagnetic metal ions Co 2+ , 
Yb 3+ , and Tm 3+ , while Zn 2+ and Lu 3+ served as diamagnetic 
references. A total of 244 PCSs were measured from five paramag¬ 
netic datasets [47], GB1 being a small protein, a stripped down 
version of GPS-Rosetta protocol was employed. Three 
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Fig. 7 Structure determination using GPS-Rosetta protocol with MAS-NMR PCSs. (a) Combined score of PCS 
energy from three tags and Rosetta energy versus the RMSD to the crystal structure of GB1 [PDBID:1 PGA, 
[48]]. Sampling from K28C is represented in red, D40C in green and E42C in blue, (b) 3D superpositions of 
calculated models using GPS-Rosetta. The crystal structure of GB1 is represented in grey, mutant K28C in red, 
D40C in green, and E42C in blue. The three lowest scored structures have an RMSD to the crystal structure of 
0.9, 0.7, and 1.1 A respectively 

independent Rosetta simulations were carried out for each mutant 
with nonhomologous fragment libraries. Around 4500, 8400, and 
10,000 all-atom models were generated for each of the three 
mutants. To take advantage of all three datasets for GB1, Rosetta’s 
all-atom structures for each of the mutants were rescored using the 
GPS-Rosetta protocol and the final structures were selected based 
on low Rosetta energy and combined low PCS score from all three 
datasets (Fig. 7a). The lowest combined energy structure was found 
to have RMSD of 0.7 A when superimposed over the crystal 
structure (Fig. 7b) [PDBID: 1PGA, [48]], at atomic resolution. 

The GPS-Rosetta protocol along with demonstration tutorials 
is available for download with the current Rosetta release. 


5 Conclusion 


Here GPS-Rosetta protocol’s success in determining 3D structures 
using PCS data from multiple tags from both solution and solid- 
state NMR experiments has been demonstrated. This method 
offers great promise in resolving structures of large proteins. PCSs 
are obtained from simple 15 N-HSQC measurements which are 
highly accurate and sensitive compared to traditional NOE mea¬ 
surements and versatile PCS datasets can be generated by swapping 
a diverse range of available paramagnetic metals, metal carrying 
tags, and peptide sequences. 

In computational modeling, incorporation of PCS data as 
structural restraints has enabled the computationally intractable 
conformational space to be explored in finite time. Inaccuracies in 
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molecular force fields always posed a challenge in identifying native 
protein fold from well-formed structural decoys and PCSs being 
long range in nature effectively discriminates the native from non¬ 
native folds. PCSs can be also complemented with other sparse 
restraints such as RDCs, PREs, and NOEs, enhancing the struc¬ 
tural information which can be efficiently exploited in computa¬ 
tional modeling. In conclusion, the hybrid approach of 
incorporating experimental PCSs with structure determination 
algorithms forms a more efficient alternative approach to solve 
protein structures than traditional methods. 
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Abstract 

Recent technological advances in sequencing and high-throughput DNA cloning have resulted in the 
generation of vast quantities of biological sequence data. Ideally the functions of individual genes and 
proteins predicted by these methods should be assessed experimentally within the context of a defined 
hypothesis. However, if no hypothesis is known a priori, or the number of sequences to be assessed is large, 
bioinformatics techniques may be useful in predicting function. 

This chapter proposes a pipeline of freely available Web-based tools to analyze protein-coding DNA and 
peptide sequences of unknown function. Accumulated information obtained during each step of the 
pipeline is used to build a testable hypothesis of function. 

The following methods are described in detail: 

1. Annotation of gene function through Protein domain detection (SMART and Pfam). 

2. Sequence similarity methods for homolog detection (BLAST and DELTA-BLAST). 

3. Comparing sequences to whole genome data. 

Key words Comparative genomics, Homology, Orthology, Paralogy, BLAST, Protein domain, Pfam, 
SMART, Ensembl, UCSC genome browser 


1 Introduction 

This chapter describes an analysis pipeline comprised of freely 
available bioinformatics sequence comparison tools that can be 
used to infer potential function from protein-coding DNA and 
peptide sequences (Fig. 1). 

1.1 What Is In a biological context, homology is defined as the existence of 

Homology? shared ancestry between a pair of structures in different species, 

either by descent or recombination. The central thesis for inferring 
related function from sequence data is that if two or more genes 
have evolved slowly enough to allow detection of statistically sig¬ 
nificant sequence similarity; common ancestry (homology) 
between the genes can be inferred. This follows from the assump¬ 
tion that the most parsimonious explanation of sequence similarity 
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integrate all results to initiate or develop hypotheses. 


Fig. 1 Analysis pipeline for inference of function. Schematic representation of analysis procedure for inference 
of function by similarity methods 


is derived from conserved ancestry rather than convergent evolu¬ 
tion [1, 2]. 

While the possession of sequence similarity is indicative of 
underlying structural similarity it may not always imply conserved 
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a Last common ancestor 



b Last common ancestor 



A1 B1 B2 Cl 


A3 A4 B3 B4 C5 


Species 1 


Species 2 Species 3 Species 1 


Species 2 Species 3 


Fig. 2 Homology and paralogy. Dotted lines represent the relationship of species 1, species 2, and species 3, 
separated by two speciation events. Genes are represented by filled circles. All genes represented are 
homologous because they have descended from a single gene in the last common ancestor. Definition of 
genes as orthologs or paralogs depends on their shared ancestry. If the genes in question are most recently 
related by a gene duplication event they are termed paralogs, whereas if the genes are most recently related 
by a speciation event, they are termed orthologs. If the gene duplication is an intra-genome event, occurring 
following speciation, the genes are further defined as in-paralogs. If the duplication is prior to a speciation 
event they are termed as out-paralogs [1 , 5]. (a) The intra-genome duplication within species 2 has resulted in 
a pair of in-paralogous sequences B1 and B2. Both B1 and B2 are orthologous to A1 and all genes are 
orthologous to Cl. (b) For a different set of genes, a gene duplication prior to the speciation of species 1 and 
2 results in a single copy of each duplicated gene being retained in both species. As a result genes A3 and B4 
are termed as out-paralogs, as are genes A4 and B3. Genes A3 and B3 share an orthologous relationship as do 
A4 and B4 


function [2]. Homologous genes that are related by a speciation 
event are termed orthologs whereas those related by gene duplica¬ 
tion events are termed paralogs (Fig. 2) [3-6]. The functions of 
orthologous genes tend to be fairly conserved; therefore, high 
sequence similarity between a gene of unknown function and a 
detected ortholog will often indicate conserved function [7]. In 
contrast, paralogous genes that have undergone a duplication event 
may either retain different but related roles (subfunctionalization) 
or rapidly diverge and undertake new roles (neofunctionalization) 

[ 5 , 8] . 

Because the identification of sequence similarity between two 
or more paralogous genes may not be indicative of conserved 
function, tools that compare sequences in this manner should be 
used with caution (see Note 1). It is recommended that the results 
generated by these tools should be augmented with additional 
information when formulating hypotheses concerning function. 
Many proteins contain functional units known as domains. In 
comparative analysis, a domain constitutes a region of conserved 
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sequence between different proteins. These may equate to func¬ 
tional units of proteins, and often encompass a hydrophobic core 
[9, 10]. Thus, domains can be thought of as the building blocks of 
protein functionality, and hence the possession or indeed lack of a 
protein domain, and the architecture of domains within a given 
protein will aid in the assessment of homology predictions and 
reduce the chances of incorrectly assigning function [11, 12]. 


2 Materials 


Tools that are described in the analysis pipeline (see Subhead¬ 
ing 3.2.5 for additional methods of interest) (Table 1). 


Table 1 
Pipeline tools 


General 

EBI Toolbox 

http://www.ebi.ac.uk/services 

Links to tools and analysis packages 

ExPASy Server 

http: // www. expasy. org/ 

Links to tools and analysis packages 

COGs 

http: // www. ncbi. nlm. nih. gov/COG/ 

Clusters of orthologous genes from 
multiple archaeal, bacterial, and 
eukaryotic genomes [7, 8] 

Ensembl 

www. ensembl. org 

Whole genome annotation [13, 14] 

UCSC Genome 
Browser 

genome. cse. ucsc. edu 

Whole genome annotation [15] 

Domain Identification Tools 

CDD 

http: // www. ncbi. nlm. nih. gov/ S tr ucture / 
cdd/cdd.shtml 

Conserved domain database. Options 
to search Pfam SMART and COG 
databases [16, 17] 

Interpro 

http: //www. ebi. ac.uk/interpro/search / 
sequence-search 

Multiple databases of protein families 
and domains [18, 19] (Includes 

Pfam, SMART, PRINTS, Prosite, 
etc.) 

Pfam 

http: //pfam. xfam .org/ 

Library of protein domain HMMs 
[20,21] 

SMART 

http: //smart. embl-heidelberg. de / 

Library of protein domain HMMs 
[22,23] 

Similarity tools 

FASTA 

http: //www. ebi. ac.uk/Tools / sss/fasta / 

Local alignment search tool [24] 

NCBI-BLAST 

http: //blast. ncbi. nlm. nih. gov/Blast. cgi 

Local alignment search tool at the 

NCBI [25, 26] 
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3 Methods 


3.1 Analysis Pipeline 
Step 1: Domain 
Identification 


3.1.1 Tools for 
Identifying Protein Domains 


In this chapter, we propose a pipeline that can be used to accurately 
categorize the functions of biological sequences. 

It is often assumed that the obligatory first step when investi¬ 
gating an unknown sequence is to perform a BLAST or FASTA 
search. If a single high significance hit to a closely related species is 
detected (and the alignment extends to the full length of both 
sequences) then it may be safe to assume that a true ortholog has 
been detected. However, if a partial alignment is reported, or if the 
results indicate similarity to more than one protein, the output may 
be more difficult to interpret. We therefore recommended that 
users conduct domain searches prior to sequence alignments. 

By definition, protein domains are conserved and although they can 
appear within genes in different combinations, they are rarely frag¬ 
mented [10]. Traditionally members of domain families are com¬ 
pared using hidden Markov models (HMM) [27]. These HMMs 
predict the probability that specific residues will occur at each 
position within a domain based on the level of conservation across 
the domain family. This method has been widely used in investiga¬ 
tions into gene families and in the annotation of whole genomes 
[28-31]. When developing hypotheses, a researcher should keep in 
mind that the presence or absence of proteins domains only par¬ 
tially defines protein function. Other biologically relevant caveats 
such as the co-occurrence of domains, domain-protein interac¬ 
tions, and protein localization (both cellular and subcellular) 
should be considered when formulating hypotheses [10, 12, 32]. 

Two tools that are widely used to compare query sequences to 
precomputed libraries of HMMs are Pfam [21, 33] and SMART 
[22,23]. 

Pfam (release 30.0) contains profile-HMMs of 16,306 protein 
domains [21]. It links detected domains to a database describing 
their taxonomic abundance, potential evolutionary origins, and 
relationships (via Pfam clans) [34]. In contrast, the latest update 
to version 7.0 of the SMART database contains fewer HMMs 
(1,200 domains) but offers the option to include the Pfam HMM 
library within a search [22]. Like Pfam, the SMART database gives 
extensive details of function, evolution, and structure. It also pro¬ 
vides links to relevant literature, information of proteins from 1133 
completely sequenced genomes (choose “Use SMART in Genomic 
mode” at the start page), and highlights inactive domains if key 
functional residues differ between the query and target sequences. 
Both Pfam and SMART databases can be searched independently 
or via metasearch tools such as CDD [16,17] or Interpro [18,19]. 
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a 



Name 

Begin End 

E-value 

Low complexity 

2 

18 

- 

Pfam: PepJVI12B_j)ropep 

61 

173 

2,40e-49 

Pfam: Reprolysin 

178 

375 

4.70e-83 

DISIN 

393 

468 

T50e-41 

ACR 

469 

607 

6,50e-53 

EGF 

613 

643 

2,80e+0G 

Transmembrane 

683 

705 

- 

Low complexity 

722 

735 

- 


b 



Domain 

Start 

End 

Evalue 

Pep_M12B_propep 

61 

173 

1,8e-50 

Reprolysin 

178 

375 

3.6e-84 

Disintegrin 

393 

468 

1.8e-37 

ADAIM_CR 

470 

587 

2.3e-48 


Fig. 3 Domain detection by Pfam and SMART. Graphical and textual representations of domains detected in an 
ADAM 2 precursor from Cavia porcellus (swissprot accession Q60411). (a) Domains detected by SMART 
version 7.0 with the additional parameters, Pfam domains, signal peptides, and internal repeats selected [22, 
23]. (b) Domains detected by Pfam version 27.0 [21,33]. A global and fragment search was conducted with 
SEG low complexity filter on and an E-value cutoff = 1.0. Significant Pfam-B domains are not shown 


Queries of Pfam and SMART with an ADAM 2 precursor from 
Cuviu porcellus (swissprot accession Q60411) identify several 
domains with significant E-value {see Subheading 3.2.1 for a 
description of E-value statistics). Both tools indicate the positions 
and architecture of domains present within the query sequence as 
well as providing the user with information regarding domain 
function, co-occurrence, evolution, and residue conservation 
(Fig. 3). For example, SMART links the ACR (ADAM Cysteine- 
Rich Domain) to an Interpro abstract that informs the user of the 
function and domain architecture of ADAM proteins, while the 
evolution section displays the abundance of the ACR domains 
within the database (565 domains, 563 proteins, all metazoa). 
Similarly, the Pfam annotation of the Reprolysin domain provides 
information regarding domain function. Figure 3 highlights the 
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3.2 Analysis Pipeline 
Step 2: Detection of 
Homologs 


3.2.1 Detection of 
Homologs by BLAST 


overlap of both these methods. Slight discrepancies are evident. 
These are reflective of the different length HMMs contained within 
each of the databases. Users are therefore encouraged to use multi¬ 
ple applications and consolidate results to achieve a consensus. 
Armed with this information, users can begin to build a hypothesis 
of sequence function (see Note 2). 

Sequence comparison tools aim to accurately infer homology from 
truly related biological sequences. Nucleotide alignment algo¬ 
rithms only look for direct similarities between sequences in base 
space and do not account for factors such as the class of amino acid 
or the relative abundance of amino acid types. Protein alignment 
algorithms integrate these factors to improve the accuracy of align¬ 
ments based on the likelihood of amino acid substitutions. For 
example, a conservative substitution; such as an isoleucine for a 
valine (both possess aliphatic R groups) would be more heavily 
weighted than a substitution of rare amino acids such as tryptophan 
or cysteine. A number of schemes have been developed to weight all 
of the possible amino acid substitutions as matrices. The most 
commonly used examples are PAM (percent accepted mutation) 
[35] and BLOSUM (Blocks Substitution matrix) [36]. PAM matri¬ 
ces are based on an evolutionary model of point acceptable muta¬ 
tions per million years whereas BLOSUM matrices are based on 
empirical datasets of aligned sequences. The suffix of a BLOSUM 
matrix denotes the maximum percentage similarity of the align¬ 
ment. Thus, the scores in BLOSUM45 and BLOSUM80 are gen¬ 
erated from sequences of >45 % and >80 % similarity, respectively 
(see Note 3). Equipped with these substitution matrices, various 
algorithms are available to align sequences in such a way so as to 
maximize the overall alignment score. Algorithms that produce a 
guaranteed optimal local alignment include the Smith-Waterman 
algorithm [37]. Due to their computational requirements such 
methods are often impractical for large datasets. To accelerate 
identification of the most significant alignments, heuristic algo¬ 
rithms such as BLAST [25, 38] and FASTA [24] have been 
developed. 

The most widely used sequence comparison tool is BLAST [25, 38] 
and the NCBI version of BLAST is probably the most commonly 
used variation. It can be used online via the NCBI web interface 
(blast.ncbi.nlm.gov/Blast.cgi) or downloaded and run locally as a 
stand-alone tool (BLAST+). This tutorial focuses on the NCBI web 
interface for a more detailed description of BLAST+ see [39]. Many 
of the nuances of BLAST and detailed descriptions of the statistics 
will not be discussed here but are covered in detail elsewhere [25,26, 
40—44]. A particularly thorough explanation is given in [39]. Also 
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refer to the BLAST Help pages (http://blast.ncbi.nlm.nih.gov/ 
Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs), tutorials (http: 
z/www.ncbi.nlm.nih.gov/books/NBK1734/), and see Note 4. 

The basic options required for a BLAST search are a query 
sequence, a database to search, the type of search, and the search 
parameters. The query sequence can be entered either as plain text, 
a valid NCBI sequence identifier, or as a fasta formatted sequence 
file where the first line (containing identifier information) is 
demarked by an initial greater than symbol (>) followed by the 
sequence on subsequent lines. It is good practice to create fasta 
formatted sequence files as the identifiers are reported in the 
BLAST output, which helps when tracking multiple search results. 
The database searched will relate to the hypothesis of the user’s 
experiment and may have implications for the test statistics (see 
Note 5). NCBI-blast has access to 7 protein and 16 nucleotide 
databases. For an initial search when identifying potential homo¬ 
logs it is best practice to search one of the nr databases. These 
contain nonredundant (nonidentical) entries from Gen-Bank trans¬ 
lations, RefSeq Proteins, PDB, SwissProt, PIR, and PEF databases. 
If a species of interest is known, then the database can be filtered by 
organism using a taxon id code (http://www.ncbi.nlm.nih.gov/ 
taxonomy/) or from predefined taxonomic groups, for example, 
primate or eukaryote. 

It is recommended that protein or translated nucleotide 
sequences are used when conducting searches to infer function. If 
a protein sequence is available it is best to search in protein space 
using blastp. If a DNA sequence is available it is best to ignore 
blastn (which searches in nucleotide space) and use either blastx or 
tblastx. In the program selection section one can observe that 
multiple BLAST algorithms are available, these are described in 
Table 2. Additional search settings are shown in the “Algorithm 
parameters” tab. Of specific note are the expect score (E-value) and 
the low complexity filter. The E-value is the statistical significance 
threshold for reporting matches against the database. The default 
value is 10. This indicates that for each alignment ten similar 
matches are expected to be found merely by chance. The lower 
the E-value the more significant the alignment [25]. For example, a 
E-value of 1 x 10 -3 indicates that the likelihood of a match occur¬ 
ring by chance is 1 in 1000 [45]. In the filters and masking section 
there is a filter to exclude low complexity regions. As these are likely 
to result in alignments of statistical but not biological significance, 
unless the user is confident in their hypothesis, these should always 
be turned on. The default filtering algorithms are SEG masking for 
protein searches [46] and DUST (Tatusov and Lipman, unpub¬ 
lished) for translated nucleotide searches. Other parameters that 
can be modified in the Algorithm parameters section include the 
word size (the number of matching residues required to seed an 
alignment extension algorithm) and the scoring parameters (the 
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Table 2 

Search parameters and common uses of NCBI-BLAST variants 


Program Query 

Database 

Search type 

Algorithms 

Common uses 

blastn 

DNA 

DNA 

DNA-DNA 

Highly similar, 
more 
dissimilar, 
somewhat 
similar 

Search for near identical DNA 
sequences, confirmation of 
DNA sequencing experiment. 
Compare query to genomic 
DNA to identify splicing 
patterns 

blastp 

Protein 

Protein 

Protein-Protein 

blastp, 

PSI-BLAST, 

PHI-BLAST, 

DELTA- 

BLAST 

Search for homologous protein 
sequences. Annotation of 
genes of unknown function. 
Searches can be direct 
(blastp), use profile models 
(PSI-BLAST, PHI-BLAST), 
or database-assisted profile 
models (DELTA-BLAST) to 
improve sensitivity 

blastx 

Protein 

DNA 

Translated 

DNA-Protein 

N/A 

Gene finding within DNA 
sequences 

tblastn 

DNA 

DNA 

Translated DNA- 
Translated DNA 

N/A 

Identify protein coding 

structures in DNA sequences 


reward and penalty for matches, mismatches, and gaps in the 
alignment). 

For our search we will keep these parameters at their default 
settings. Clicking the “BLAST” button will submit your search. A 
new status window will open that automatically updates until the 
search is complete and the results page appears. The blast results 
page consists of a header containing information of query and 
database searched, a graphical representation of the alignments, a 
summary of each significant hit, and a footer containing details of 
the search statistics. The graphical summary displays the significant 
hits (colored according to the degree of similarity) to the query 
sequence (at the top of the graphic). This view gives the user ready 
information regarding the region(s) of the query sequence that 
produce significant hits. The one-line output summary ranks each 
hit by E-value. Each hit is hyperlinked to a corresponding entrez 
database entry containing links to associated genes, structures, 
taxonomy, and publications. Scrolling down or clicking on an 
individual score will show the alignments. Each aligned region 
(known as a high scoring segment pair or hsp) has a header with 
gene identifier and score summary. The bit score “Score = x bits” is 
determined from the raw scores of the alignment as defined in the 
substitution matrix. From this the E-value “Expect = x” is 
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3.2.2 Assessing the 
Results of a BLAST Search 


3.2.3 Detection of More 
Distant Homologs 


calculated. The alignment section also visualizes the hsp to show 
the specific similarities and differences between the query and 
subject sequences. Sandwiched between these sequences, identical 
matches are highlighted by corresponding amino acids and con¬ 
served matches by a plus sign. This section of the page also links to 
additional resources including gene information, Map Viewer (for 
genomic localization), and lists of known identical homologs. 

Confidence can be placed in an homology assignment if all relevant 
domains are conserved, form the same architecture and key residues 
known to be important for function are shared in the correct spatial 
context. Thus, hsps should always be critically assessed using infor¬ 
mation determined from domain searches. When viewing an align¬ 
ment, users should pose the questions; “do the start and end 
positions of the hsp correspond to a predicted domain?” If so, 
“do aligned residues correspond to critical residues described in 
the domain annotation?” If conserved functional residues are not 
aligned then caution should be exercised in assigning function. 
Alignments should also be checked for residue bias that has escaped 
the low-complexity filters. Certain proteins, for example, myosins 
(which are rich in lysine and glutamic acid) have inherent composi¬ 
tional biases that can affect alignment scores. When investigating 
such sequences users should assess corresponding residues to check 
whether the significance of the alignment is due to both protein 
types sharing a common bias rather than a common function. 

The BLAST method identifies homologs by comparing sequences 
directly. As a result it will undoubtedly miss align homologs with 
more divergent ancestry where greater sequence change is 
expected. In such cases, more powerful methods are required. 
When viewing the output of alignment, it can be observed that 
some regions are highly conserved whereas others accept greater 
numbers of substitutions. This information can be translated into 
profile-sequence models and used to guide alignments. 

Profile-sequence models are weighted based upon the number 
of expected matches and mismatches within a given region of 
sequence. As a result mismatches in conserved areas are penalized 
to a greater extent compared with mismatches in regions of high 
variability. NCBI offers two profile-sequence model methods for 
predicting more distant homologs, PSI-BLAST (position-specific 
iterated BLAST) and DELTA-BLAST (Domain enhanced lookup 
time accelerated BLAST). Both methods utilize position-specific 
score matrix (PSSM) profile-sequence models [26, 40, 42]. A 
PSSM is an L x 20 amino acid matrix of scores where L is the 
length of the query sequence and 20 represents each possible amino 
acid. Each position is subsequently weighted according to its con¬ 
servation within the multiple alignment. Conserved positions are 
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Fig. 4 Detection of homologs by PSI-BLAST and DELTA-BLAST, (a) Hypothetical, sequences A-F are distantly 
related homologs. Their unknown relationship and similarity are represented by distance on the stylized 
phylogenetic tree. Initiating a standard gapped BLAST [25, 26] search with sequence of unknown function A 
would result in identification of similar sequences B and C. If no other sequences were identified we have no 
functional information with which to annotate sequence A. However, if a PSSM approach is used the additive 
information of sequences A, B, and C will allow for the detection of sequence D and subsequently functionally 
annotated sequences E and F in later iterations of the algorithm. The BLAST methodology means that if 
sequences A and D are homologous as are sequences D and E, it follows that A and E must also be 
homologous allowing annotation of the initial query sequence, (b) Schematic representation of an alignment of 
sequences A-F. Aligned domains share a color, whereas unaligned regions are represented by open boxes. To 
correctly annotate a gene with functional information, the alignments described must occur in the same region 
of the alignment. Therefore, while sequences E and F are related by the presence of the solid black domain, its 
function may not be reflected in sequences A-D as these sequences do not contain this domain 


assigned high scores whereas regions of high variability are assigned 
scores close to zero [47, 48]. The basis of the PSSM approach is 
outlined in more detail in Fig. 4. 

PSI-BLAST first compares the query sequence to a defined 
database using the standard gapped BLAST algorithm [26, 43]. 
From this initial search, significant matches (the NCBI default E- 
value is 0.005) are selected and a multiple sequence alignment 
generated (matches identical to the query sequence or >98 % 
identical to another match are purged to avoid redundancy). A 
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3.2.4 Assessing the 
Results of PSI-BLAST and 
DELTA-BLAST Searches 


3.2.5 Additional Methods 
of Homolog Detection 


PSSM is generated and used to seed a new alignment. Any signifi¬ 
cant hits are added to the multiple sequence alignment and the 
process is repeated in an iterative manner until convergence occurs 
(no new sequences with significance below the E-value cutoff are 
detected) (see Note 6). Using PSSM’s in this way results in a wider 
search of the sequence space, improves sensitivity, and incorporates 
more distant homologs. In contrast, DELTA-BLAST utilizes a 
precomputed database, the NCBI Conserved Domain Database 
(CDD), to guide the initial PSSM model [42]. This resource was 
developed to identify conserved domains within protein sequences 
and includes manually curated domain models (which have been 
refined using protein 3D structures), as well as models constructed 
from clusters of related sequences. After the initial alignment step 
DELTA-BLAST proceeds using the same iterative PSSM model as 
PSI-BLAST. 

The user should be aware that the primary concern for false predic¬ 
tion of homology by PSI-BLAST is inclusion of a nonhomologous 
sequence into the PSSM, which can be particularly problematic if 
the profile is compositionally biased. Lor example, if a profile model 
includes a protein domain common to several of the target 
sequences but not shared with the query then the model may be 
incorrectly enriching for that domain. By seeding the PSSM model 
with domains known to be related to the query sequence DELTA- 
BLAST reduces the likelihood of these compound errors but 
instead can be subject to database bias if the query sequence con¬ 
tains predominantly uncommon domains. Therefore, as with stan¬ 
dard BLAST searches, the user should exhibit caution when 
interpreting the results [26, 41, 43, 44, 49]. In both cases, the 
incorporation of a nonhomologous sequence can lead to the iden¬ 
tification and subsequent profile inclusion of sequences with high 
similarity to the erroneous sequence rather than to the query 
sequence. As with any BLAST search the alignment should be 
inspected carefully. Due to the iterative nature of these methods, 
any sequences included when they should not be, usually leads to 
an amplification of problems that may go unnoticed if the user is 
not vigilant. The user should look for a similar conservation pattern 
in each of the alignments. If a sequence seems to be included 
erroneously, it can be excluded from the PSSM and subsequent 
searches by unchecking the relevant radio button in the BLAST 
output. If the sequence returns in later iterations seek corrobora¬ 
tion of the finding by other means such as reciprocal searching 
(see Note 7). 

There are several other available methods that employ profile or 
HMM sequence comparison or combine multiple methods to 
infer function. Interested users should investigate these as they 
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potentially offer greater sensitivity for detection of distant homo- 
logs [2, 50], 

CombFunc: (http://www.sbg.bio.ic.ac.uk/ ~mwass/com 

bfunc/) [51]. The CombFunc Webserver employs multiple 
approaches to determine function including BLAST-based 
sequence similarity, protein fold prediction, gene ontology, pro¬ 
tein-protein interaction, and gene co-expression data. Support for 
function is determined by combining these results using a support 
vector machine learning approach. 

FFPRED: (http://bioinf.cs.ucl.ac.uk/) [52]. The FFPRED 
server is a powerful tool that aims to assign gene ontology [53] 
biological process and molecular function terms to difficult to 
annotate sequences based on the characteristics of the searched 
amino acid sequence. The FFPRED server performed very well in 
this task during the recent critical assessment of protein function 
annotation (CAFA) experiment [52]. 

Blocks: (http://blocks.fhcrc.org/) [54, 55]. Blocks utilizes a 
database of ungapped multiple alignments that correspond to the 
highly conserved regions of proteins. Query sequences can be 
compared to the Blocks database via the block searcher tool, 
IMPALA (comparison of query to database of PSSMs) [56] and 
LAMA (comparison of multiple alignment to Blocks using profile: 
profile method) [57]. 

COMPASS: http://prodata.swmed.edu/compass/compass, 

php [58-61]. COMPASS generates statistical comparisons of mul¬ 
tiple protein alignments via profile generation. 

HH-pred: (http://toolkit.tuebingen.mpg.de/hhpred) [62, 
63]. HH-pred uses the HHsearch algorithm to search protein 
and domain databases by pairwise alignment of profile-HMMs. 
Alignment incorporates predicted structural information and can 
generate an HMM from a single submitted query sequence by 
automated PSI-BLAST search. 

Hmmer: (http://hmmer.janelia.org/) [27, 64]. Hmmer uses a 
Profile-HMM method of sequence comparison. Tools include: 
hmmbuild (HMM construction based on a multiple sequence 
alignment), hmmalign (align sequence(s) to an existing HMM), 
and hmmsearch (search a database of sequences for a significant 
match to an HMM). 

3.3 Analysis Pipeline Recent advances in sequencing and genome annotation have led to 
Step 3: Genomic the generation of datasets that can provide users with vast amounts 

Sequence Comparison of precomputed and cataloged information. Linking of a query 
sequence to a gene in these databases allows rapid access to func¬ 
tional annotation including predicted orthologs and paralogs, gene 
structure, gene expression, splice variation, association with disease, 
chromosomal location, and gene polymorphism data. Thus, infer¬ 
ring homology (either using keyword or similarity searches as 
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described previously) to a gene from a multigenome resource as 
those described in this section should be a final step in the analysis 
pipeline. The annotation of a gene in this way may corroborate 
findings determined during previous steps and may offer additional 
data to reinforce a working hypothesis. These databases also have an 
advantage in that they are regularly updated with improved gene 
predictions and annotations. The resource used will depend on the 
organism from which the query sequence was obtained (see Note 
7). Although some data overlap is inevitable, users are encouraged 
to try each tool to survey the breadth of information available 
(see Note 8). 

For many eukaryotic organisms the UCSC genome browser 
[15] and Ensembl genome server [13, 14] are ideal sources of 
information. Both include a wealth of annotation data for the 
genomes of multiple organisms, and direct links between the two 
tools are provided. The contents of these databases are regularly 
updated and reflect the current trend for whole-genome sequenc¬ 
ing of biologically relevant model organisms and increasingly 
organisms of interest for comparative evolutionary analysis [31]. 
Searching of these databases can be via a gene identifier or by a 
similarity query (BLAST at Ensembl and BLAT at UCSC genome 
browser) [65]. 

In addition, the COGs database housed at the NCBI contains 
clusters of orthologous genes (COGs) that are typically associated 
with a specific and defined function [7, 66]. Although primarily a 
tool for comparison of prokaryotes and unicellular eukaryotes, the 
COG database also includes many eukaryotic genomes [67]. Of 
particular use is the interlinking of the COG database with the 
other databases at the NCBI [38], allowing direct links from a 
BLAST hit to a predicted COG via the Blast Link (BLink) tool. 

3.4 Conclusion The prediction of function from sequence data alone is a complex 

procedure. Appropriate prior information regarding data such as 
the tissue or developmental stage from which sequences were col¬ 
lected should be added to working hypotheses as analysis is con¬ 
ducted. It should also be remembered that predictive tools, 
although based on robust algorithms, can sometimes produce 
inconsistent or incorrect results. Therefore, the experimenter 
should look for the convergence between multiple methods to 
improve confidence in prediction and seek experimental verification 
where possible. 


4 Notes 


1. In describing any predicted relationship it must be remembered 
that the terms similarity and homology are not interchangeable. 
Often sequences are described as containing n percent similarity. 
It is not, however, correct to use the term percent homologous; 
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genes either are homologous (implying an ancestral relation¬ 
ship) or they are not [3]. 

2. When conducting domain analysis, note the positions of 
domains detected and conduct some background research of 
the essential residues and predicted function of these domains. 
Record the probability associated with any domain prediction 
and the database version searched. 

3. To specifically identify recent homologs or in-paralogs, search an 
appropriate database with a shallow scoring matrix (BLO- 
SUM80, PAM20) that will have a shorter look-back time, thus 
biasing toward more recent homologs. 

4. Only use PSI-BLAST or DELTA-BLAST if you are attempting 
to identify distant homologs of unusual sequences. If an abun¬ 
dant domain known to be present in many different protein 
types (e.g., zf-C2H2 Zinc fingers, of which there are thousands 
known within any given species), consider masking this region 
before running a BLAST search to avoid detection of an excess 
of hits that provide little additional predictive information. If a 
representative structural sequence is available, comparison of a 
query sequence to the protein data bank PDB (http://www. 
rcsb.org/pdb/) can help in identifying structural and functional 
conserved residues. Increasing the gap penalty may decrease the 
significance of unrelated sequences, improving the signal-to- 
noise ratio for a true hit but at a cost of missing true homologs. 

5. E-value scores are correlated to database size. Therefore, choos¬ 
ing which database to search will affect the significance or inter¬ 
pretation of results obtained. Lor example, to identify an 
ortholog in bacterial genomes, searching a database of only 
bacterial sequences will reduce the search space and improve 
the significance of an E-value for a given alignment. In relation 
to search database size, searching the large numbers of near 
identical sequences held in the nr database could potentially 
result in missing a true homolog with threshold significance. 
Alternatively, a significant hit close to the threshold when 
searching a small database should be checked carefully and cor¬ 
roborated by other methods to avoid false-positives. 

6. The relevant E-value for a hit sequence is the value when it is first 
identified, not at convergence or at completion of a set number 
of iterations. This is because inclusion of a sequence refines the 
PSSM for subsequent searches and will lead to greater signifi¬ 
cance of that sequence in subsequent iterations. 

7. If the organism from which the query sequence was obtained is 
not currently available, compare to taxonomically related organ¬ 
isms to build a working hypothesis of a similar function for the 
query gene. 
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8. Why is there no significant hit when I BLAST the genome of the 
same or closely related organism? Many methods of whole 
genome sequencing utilize a process of fragmentation, sequenc¬ 
ing, and computational reassembly of genomic DNA (Whole 
Genome Shotgun sequencing). Depending on the depth of 
coverage and the heterozygosity of the genomic DNA, this 
approach will result in varying degrees of incomplete noncon¬ 
tiguous sequences. Genes apparently missing from the genome 
may be located in these gaps or in repetitive hard-to-sequence 
regions of the genome. An alternative possibility is that the gene 
prediction tools used to annotate the genome and predict genes 
may have not predicted the query gene correctly. 


References 

1. Doolittle RF (1981) Similar amino acid 
sequences: chance or common ancestry? Sci¬ 
ence 214(4517):149-159 

2. Pearson WR, Sierk ML (2005) The limits of 
protein sequence comparison? Curr Opin 
Struct Biol 15(3):254-260 

3. Fitch WM (2000) Homology a personal view 
on some of the problems. Trends Genet 16 
(5):227-231 

4. Henikoff S, Greene EA, Pietrokovski S, Bork P, 
Attwood TK, Hood L (1997) Gene families: 
the taxonomy of protein paralogs and chi¬ 
meras. Science 278(5338):609-614 

5. Sonnhammer EL, Koonin EV (2002) Orthol- 
ogy, paralogy and proposed classification for 
paralog subtypes. Trends Genet 18 
(12):619-620 

6. Weber MJ (2005) New human and mouse 
microRNA genes found by homology search. 
FEBS J 272(l):59-73 

7. Tatusov RL, Galperin MY, Natale DA, Koonin 
EV (2000) The COG database: a tool for 
genome-scale analysis of protein functions and 
evolution. Nucleic Acids Res 28(l):33-36 

8. Hurles M (2004) Gene duplication: the geno¬ 
mic trade in spare parts. PLoS Biol 2(7):E206 

9. Bateman A (1997) The structure of a domain 
common to archaebacteria and the homocysti- 
nuria disease protein. Trends Biochem Sci 22 
(1):12-13 

10. Ponting CP, Russell RR (2002) The natural 
history of protein domains. Annu Rev Biophys 
Biomol Struct 31:45-71 

11. Ponting CP (2001) Issues in predicting protein 
function from sequence. Brief Bioinform 2 
(1): 19-29 

12. Ponting CP, Dickens NJ (2001) Genome car¬ 
tography through domain annotation. 
Genome Biol 2(7), Comment 2006 


13. Flicek P, Ahmed I, Amode MR, Barrell D, Beal 
K, Brent S, Carvalho-Silva D, Clapham P, 
Coates G, Fairley S et al (2013) Ensembl 
2013. Nucleic Acids Res 41(Database issue): 
D48-D55 

14. Hubbard T, Barker D, Birney E, Cameron G, 
Chen Y, Clark L, Cox T, Cuff J, Curwen V, 
Down T et al (2002) The Ensembl genome 
database project. Nucleic Acids Res 30 
(1):38-41 

15. Meyer LR, Zweig AS, Hinrichs AS, Karolchik 
D, Kuhn RM, Wong M, Sloan CA, Rosen- 
bloom KR, Roe G, Rhead B et al (2013) The 
UCSC Genome Browser database: extensions 
and updates 2013. Nucleic Acids Res 41(Data- 
base issue):D64-D69 

16. Marchler-Bauer A, Panchenko AR, Shoemaker 
BA, Thiessen PA, Geer LY, Bryant SH (2002) 
CDD: a database of conserved domain align¬ 
ments with links to domain three-dimensional 
structure. Nucleic Acids Res 30(l):281-283 

17. Mar chler-Bauer A, Zheng C, Chitsaz F, Derby¬ 
shire MK, Geer LY, Geer RC, Gonzales NR, 
Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F, 
Lu S, Marchler GH, Song JS, Thanki N, Yama- 
shita RA, Zhang D, Bryant SH (2013) CDD: 
conserved domains and protein three- 
dimensional structure. Nucleic Acids Res 41 
(Database issue):D348-D352 

18. Apweiler R, Attwood TK, Bairoch A, Bateman 
A, Birney E, Biswas M, Bucher P, Cerutti L, 
Corpet F, Croning MD et al (2001) The Inter- 
Pro database, an integrated documentation 
resource for protein families, domains and 
functional sites. Nucleic Acids Res 29 
(l):37-40 

19. Jones P, Binns D, Chang HY, Fraser M, Li W, 
McAnulla C, McWilliam H, Maslen J, Mitchell 
A, Nuka G, Pesseat S, Quinn AF, Sangrador- 
Vegas A, Scheremetjew M, Yong SY, Lopez R, 



Inferring Function from Homology 


39 


Hunter S (2014) InterProScan 5: genome - 
scale protein function classification. Bioinfor¬ 
matics 30(9):1236-1240 

20. Bateman A, Birney E, Durbin R, Eddy SR, 
Finn RD, Sonnhammer EL (1999) Pfam 3.1: 
1313 multiple alignments and profile HMMs 
match the majority of proteins. Nucleic Acids 
Res 27(l):260-262 

21. Finn RD, Bateman A, Clements J, Coggill P, 
Eberhardt RY, Eddy SR, Heger A, Hethering- 
ton K, Holm L, Mistry J, Sonnhammer EL, 
Tate J, Punta M (2014) Pfam: the protein 
families database. Nucleic Acids Res 42(Data¬ 
base issue):D222-D230 

22. Letunic I, Doerks T, Bork P (2012) SMART 7: 
recent updates to the protein domain annota¬ 
tion resource. Nucleic Acids Res 40(Database 
issue):D302-D305 

23. Schultz J, Milpetz F, Bork P, Ponting CP 
(1998) SMART, a simple modular architecture 
research tool: identification of signaling 
domains. Proc Natl Acad Sci U S A 95 
(ll):5857-5864 

24. Pearson WR, Lipman DJ (1988) Improved 
tools for biological sequence comparison. 
Proc Natl Acad Sci U S A 85(8):2444-2448 

25. Altschul SF, Gish W, Miller W, Myers EW, Lip- 
man DJ (1990) Basic local alignment search 
tool. J Mol Biol 215(3):403-410 

26. Altschul SF, Madden TL, Schaffer AA, Zhang 
J, Zhang Z, Miller W, Lipman DJ (1997) 
Gapped BLAST and PSI-BLAST: a new gener¬ 
ation of protein database search programs. 
Nucleic Acids Res 25(17):3389-3402 

27. Eddy SR (1998) Profile hidden Markov mod¬ 
els. Bioinformatics 14(9):75 5-763 

28. Gibbs RA, Weinstock GM, Metzker ML, 
Muzny DM, Sodergren EJ, Scherer S, Scott 
G, Steffen D, Worley ICC, Burch PE et al 
(2004) Genome sequence of the Brown Nor¬ 
way rat yields insights into mammalian evolu¬ 
tion. Nature 428(6982):493-521 

29. Lander ES, Linton LM, Birren B, Nusbaum C, 
Zody MC, Baldwin J, Devon IC, Dewar IC, 
Doyle M, FitzHugh W et al (2001) Initial 
sequencing and analysis of the human genome. 
Nature 409(6822):860-921 

30. Ellsworth RE, Jamison DC, Touchman JW, 
Chissoe SL, Braden Maduro W, Bouffard 
GG, Dietrich NL, Beckstrom-Sternberg SM, 
Iyer LM, Weintraub LA et al (2000) Compara¬ 
tive genomic sequence analysis of the human 
and mouse cystic fibrosis transmembrane con¬ 
ductance regulator genes. Proc Natl Acad Sci U 
S A97(3):1172-1177 

31. Ernes RD, Goodstadt L, Winter EE, Ponting 
CP (2003) Comparison of the genomes of 


human and mouse lays the foundation of 
genome zoology. Hum Mol Genet 12 
(7):701-709 

32. Schultz J, Copley RR, Doerks T, Ponting CP, 
Bork P (2000) SMART: a web-based tool for 
the study of genetically mobile domains. 
Nucleic Acids Res 28(l):231-234 

33. Sonnhammer EL, Eddy SR, Birney E, Bateman 
A, Durbin R (1998) Pfam: multiple sequence 
alignments and HMM-profiles of protein 
domains. Nucleic Acids Res 26(l):320-322 

34. Finn RD, Mistry J, Schuster-Bockler B, 
Griffiths-Jones S, Hollich V, Lassmann T, 
Moxon S, Marshall M, Khanna A, Durbin R, 
Eddy SR, Sonnhammer EL, Bateman A (2006) 
Pfam: clans, web tools and services. Nucleic 
Acids Res 34(Database issue):D247-D251 

35. Henikoff S, Henikoff JG (1993) Performance 
evaluation of amino acid substitution matrices. 
Proteins 17(1):49-61 

36. Henikoff S, Henikoff JG (1992) Amino acid 
substitution matrices from protein blocks. Proc 
Natl Acad Sci U S A 89(22):10915-10919 

37. Smith TF, Waterman MS (1981) Identification 
of common molecular subsequences. J Mol 
Biol 147(1):195-197 

38. Wheeler DL, Barrett T, Benson DA, Bryant 
SH, Canese K, Chetvernin V, Church DM, 
Dicuccio M, Edgar R, Federhen S et al 
(2008) Database resources of the National 
Center for Biotechnology Information. 
Nucleic Acids Res 36(Database issue): 
D13-D21 

39. Pearson WR (2014) BLAST and FASTA simi¬ 
larity searching for multiple sequence align¬ 
ment. Methods Mol Biol 1079:75-101 

40. Altschul SF, Gertz EM, Agarwala R, Schaffer 
AA, Yu YK (2009) PSI-BLAST pseudocounts 
and the minimum description length principle. 
Nucleic Acids Res 37(3):815-824 

41. Altschul SF, Koonin EV (1998) Iterated profile 
searches with PSI-BLAST—a tool for discovery 
in protein databases. Trends Biochem Sci 23 
(ll):444-447 

42. Boratyn GM, Schaffer AA, Agarwala R, 
Altschul SF, Lipman DJ, Madden TL (2012) 
Domain enhanced lookup time accelerated 
BLAST. Biol Direct 7:12 

43. Jones DT, Swindells MB (2002) Getting the 
most from PSI-BLAST. Trends Biochem Sci 27 
(3):161-164 

44. Korf I (2003) Serial BLAST searching. Bioin¬ 
formatics 19(12): 1492-1496 

45. Altschul SF, Bundschuh R, Olsen R, Hwa T 
(2001) The estimation of statistical parameters 
for local alignment score distributions. Nucleic 
Acids Res 29(2):351-361 



40 


Tom C. Giles and Richard D. Ernes 


46. Wootton JC, Federhen S (1996) Analysis of 
compositionally biased regions in sequence 
databases. Methods Enzymol 266:554-571 

47. Altschul SF, Gish W (1996) Local alignment 
statistics. Methods Enzymol 266:460-480 

48. Henikoff S (1996) Scores for sequence 
searches and alignments. Curr Opin Struct 
Biol 6(3):353-360 

49. Schaffer AA, Aravind L, Madden TL, Shavirin 
S, Spouge JL, Wolf YI, Koonin EV, Altschul SF 
(2001) Improving the accuracy of PSI-BLAST 
protein database searches with composition- 
based statistics and other refinements. Nucleic 
Acids Res 29(14):2994-3005 

50. Sierk ML, Pearson WR (2004) Sensitivity and 
selectivity in protein structure comparison. 
Protein Sci 13(3):773-785 

51. Wass MN, Barton G, Sternberg MJ (2012) 
CombFunc: predicting protein function using 
heterogeneous data sources. Nucleic Acids Res 
40(Web Server issue):W466-W470 

52. Minneci F, Piovesan D, Cozzetto D, Jones DT 
(2013) FFPred 2.0: improved homology- 
independent prediction of gene ontology 
terms for eukaryotic protein sequences. PLoS 
One 8(5):e63754 

53. Ashburner M, Ball CA, Blake JA, Botstein D, 
Butler H, Cherry JM, Davis AP, Dolinski K, 
Dwight SS, Eppig JT et al (2000) Gene ontol¬ 
ogy: tool for the unification of biology. The 
Gene Ontology Consortium. Nat Genet 25 
(l):25-29 

54. Henikoff S, Pietrokovski S, Henikoff JG 
(1998) Superior performance in protein 
homology detection with the Blocks Database 
servers. Nucleic Acids Res 26(1):309-312 

55. Henikoff JG, Pietrokovski S, McCallum CM, 
Henikoff S (2000) Blocks-based methods for 
detecting protein homology. Electrophoresis 
21(9): 1700-1706 

56. Schaffer AA, WolfYI, Ponting CP, Koonin EV, 
Aravind L, Altschul SF (1999) IMPALA: 
matching a protein sequence against a collec¬ 
tion of PSI-BLAST-constructed position- 
specific score matrices. Bioinformatics 15 
( 12 ): 1000-1011 


57. Pietrokovski S (1996) Searching databases of 
conserved sequence regions by aligning protein 
multiple-alignments. Nucleic Acids Res 24 
(19):3836-3845 

58. Sadreyev R, Grishin N (2003) COMPASS: a 
tool for comparison of multiple protein align¬ 
ments with assessment of statistical signifi¬ 
cance. J Mol Biol 326(l):317-336 

59. Sadreyev RI, Grishin NV (2004) Quality of 
alignment comparison by COMPASS improves 
with inclusion of diverse confident homologs. 
Bioinformatics 20(6):818-828 

60. Sadreyev RI, Tang M, Kim BH, Grishin NV 
(2007) COMPASS server for remote homol¬ 
ogy inference. Nucleic Acids Res 35(Web 
Server issue):W653-W658 

61. Sadreyev RI, Tang M, Kim BH, Grishin NV 
(2009) COMPASS server for homology detec¬ 
tion: improved statistical accuracy, speed and 
functionality. Nucleic Acids Res 37(Web Server 
issue): W9 0-W94 

62. Soding J, Biegert A, Lupas AN (2005) The 
HHpred interactive server for protein homol¬ 
ogy detection and structure prediction. Nucleic 
Acids Res 33(Web Server issue):W244-W248 

63. Hildebrand A, Remmert M, Biegert A, Soding 
J (2009) Fast and accurate automatic structure 
prediction with HHpred. Proteins 77(Suppl 
9):128-132 

64. Eddy SR (2011) Accelerated profile HMM 
searches. PLoS Comput Biol 7(10):el002195 

65. Kent WJ (2002) BLAT—the BLAST-like align¬ 
ment tool. Genome Res 12(4):656-664 

66. Marchler-Bauer A, Anderson JB, Derbyshire 
MK, DeWeese-Scott C, Gonzales NR, Gwadz 
M, Hao L, He S, Hurwitz DI, Jackson JD et al 
(2007) CDD: a conserved domain database for 
interactive domain family analysis. Nucleic 
Acids Res 35(Database issue):D237-D240 

67. Wheeler DL, Church DM, Federhen S, Lash 
AE, Madden TL, Pontius JU, Schuler GD, 
Schriml LM, Sequeira E, Tatusova TA, Wagner 
L (2003) Database resources of the National 
Center for Biotechnology. Nucleic Acids Res 
31(l):28-33 



Chapter 3 


Inferring Functional Relationships from Conservation 
of Gene Order 

Gabriel Moreno-Hagelsieb 

Abstract 

Predicting functional associations using the Gene Neighbor Method depends on the simple idea that if 
genes are conserved next to each other in evolutionarily distant prokaryotes they might belong to a 
polycistronic transcription unit. The procedure presented in this chapter starts with the organization of 
the genes within genomes into pairs of adjacent genes. Then, the pairs of adjacent genes in a genome of 
interest are mapped to their corresponding orthologs in other, informative, genomes. The final step is to 
verify if the mapped orthologs are also pairs of adjacent genes in the informative genomes. 

Key words Conservation of gene order, Operon, Genomic context, Functional inference, Gene 
neighbor method 


1 Introduction 


Two independent works first presented data supporting the idea 
that genes conserved together in evolutionarily distant genomes 
might have functional associations [1,2]. However, the first thor¬ 
oughly described method to infer functional relationships using 
conservation of gene order might be the work published by Over- 
beelc et al. [3]. Nowadays, the method is part of the RAST/MG- 
RAST system of functional annotations [4, 5]. The method finds 
support in three main ideas (Fig. 1): (1) the knowledge that genes 
in operons, stretches of adjacent genes in the same DNA strand 
transcribed into a single mRNA [6, 7], are functionally related; (2) 
by the expectation that operons should be conserved throughout 
evolution; and (3) by the finding that gene order in general is not 
conserved [8], and is lost much faster than protein sequence iden¬ 
tity [9]. Thus, conservation of gene order at long evolutionary 
distances might indicate a functional relationship. Some divergently 
transcribed genes (Fig. 1) might also be functionally related (see, for 
instance [10-12]). However, Overbeek et al. [3] found that the 
conservation of divergently transcribed genes in evolutionarily 
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Fig. 1 Pairs of genes used in the study of conservation of gene order. Genes are 
represented by arrows indicating the direction of transcription. Same-strand 
genes would be the ones that might be in operons, and thus functionally 
related. Genes in opposite strands can be either divergently transcribed or 
convergently transcribed. Comparing the conservation of gene order of genes 
in the same strand against that of genes in opposite strands helps calculate a 
confidence value for predictions of functional interactions 

distant genomes was minimal compared to the conservation of 
genes in the same strand. Thus, they limited their analyses to the 
detection of conservation of adjacency of genes in the same DNA 
strand. Given that the significance of conservation of adjacency 
increases with the phylogenetic distance of the genomes compared, 
Overbeek et al. [3] directly used phylogenetic distances as a score. 
However, selecting an appropriate threshold was a problem. Ermo¬ 
laeva et al. [13] proposed the conservation of adjacency of genes in 
opposite strands as a representative of background conservation 
useful to calculate a confidence value for the conservation of adja¬ 
cent genes in the same strand. An approach using a simplified 
method similar to that presented by Ermolaeva et al. [13] was 
used to show that conservation of adjacency of paralogous genes 
is also useful for predicting operons and functional relationships of 
gene products [14]. 

Another approach to conservation of gene order is to count the 
number of genomes in which a given pair of genes are conserved 
next to each other (see, for instance [15,16]). The main problem of 
such an approach is that conservation of adjacency in very closely 
related genomes is not as informative as that among evolutionarily 
distant genomes. A later approach uses phylogenetic relationships 
to correct for this problem [17]. I present here the simplified 
method that uses adjacent genes in opposite strands for calibration 
mentioned earlier [14]. The confidence values obtained in this 
method provide a direct measure of significance that is very easy 
to understand. Moreover, there are no results yet showing that 
accounting for the number of genomes in which the genes are 
conserved produces any better results than conservation in evolu¬ 
tionarily distant genomes. 
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2 Systems, Software, and Databases 


2.1 A UNIX-Based 
Operative System 


2.2 The RefSeq 
Bacterial Genome 
Database 


I am assuming that the user is working with a UNIX-based 
operating system. The prevalent UNIX-based systems today 
include Mac OSX, and many other systems based on Linux, like 
Ubuntu and Debian. Specialized computer system servers and 
workstations tend to also run under a UNIX-based operative 
system. 

The RefSeq database [18, 19] contains files with diverse informa¬ 
tion about each genome. The database can be downloaded using 
programs such as “wget” or “rsync” (see Note 1). For instance, 
periodically running the command: 

rsync -av rsync://rsync.ncbi.nlm.nih.gov/genomes/ 
refseq/Bacteria/\ 

LOC AL_GENOME S —delete 

would keep an updated directory “LOCAL_GENOMES” with all 
the information in the directory “/genomes/refseq/Bacteria” of 
the NCBI rsync server (see Note 2). Here I will be using three files 
under each genome directory, those ending with “.gbk,” with “.ptt” 
and with “.rnt” (from now on called GBK, PTT, and RNT files). 
Though the GBK file generally contains all of the necessary infor¬ 
mation, the PTT and RNT files are more programmer friendly. 


2.3 NCBTs BLAST+ To map the corresponding orthologous [homologous] genes it is 
necessary to compare proteins encoded by the genes within all 
genomes. Identifying orthologs is important for any genome con¬ 
text method for inferring functional associations [9]. Appropriate 
binaries of NCBTs BLAST+ [20] program suite can be down¬ 
loaded from NCBTs servers using rsync at: rsync://rsync.ncbi. 
nlm. nih. gov/blast/executables/LATEST /. 


3 Methods 


The method starts with the construction of files or databases of 
gene neighbors. For each gene in a given pair of neighbors in the 
genome of interest, the method verifies the existence of orthologs 
in an informative genome. If the orthologs exist, their adjacency is 
investigated. The following pseudocode summarizes this method: 

GENE_NEIGHBOR_METHOD 

1 for each informative_genome 

2 count_conserved <- 0 

3 conserved_list <- " " 
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4 for each NEIGHBORS (a,£>) in genome_of_interest 

5 if (ORTH (a) AND ORTH (jb) ) in informative_genome 

6 if (NEIGHBORS (ORTH(a) ,ORTH (jb) ) ) in 
i n forma t i ve_gen om e 

7 ADD(a,jb) to conserved_list 

8 count_conserved <- count_conserved + 1 

9 return (informative_genome, count_conserved, 
conserved_list) 

Notice that the results are returned for each informative 
genome. This is important in order to calculate confidence scores. 
Throughout the detailed method later, I will be using PERL code 
to exemplify each step. The programs and example files can also be 
downloaded from http://microbiome.wlu.ca/GeneNeighbor/. 

3.1 Learn Orthologs Orthologs are defined as genes that diverge after a speciation event 
[21]. Such genes can also be colloquially referred to as the “same 
genes” in different species. Accordingly, orthologs are the appro¬ 
priate genes to compare in the Gene Neighbor method. In com¬ 
parative genomics, the most commonly used working definition of 
orthology is reciprocal best hits [22-24]. Two genes in two differ¬ 
ent genomes are reciprocal best hits if, when each is used as a query, 
each finds the other as its top scoring hit (see Note 3). 

The possibility of adjacently conserved paralogs, genes that 
diverge after duplication events [21], was also discussed by Over- 
beek et al. [3]. Moreover, other work has shown that operons have 
a tendency toward producing paralog operons [14, 25], and that 
strict detection of orthologs is not necessary for prediction of 
functional association [14]. Thus, here I use conservation of unidi¬ 
rectional best hits for predicting interactions by conservation of 
gene order. In order to detect the top best hits for genes in a target 
genome, the protein sequences encoded by the genes in the target 
genome are compared against those encoded by the informative 
genome using BLASTP ( see Note 4): 

blastp -query genome_of_interest -db inf ormat ive_genome 
-evalue le-4\ 

-seg yes -soft_masking true -use_sw_tback -outfmt 6 - 
out \ 

genome_of_inter est. inf ormat ive_genome .blastp 

This will produce a table of BLAST hits between the genome of 
interest and a given informative genome in the file “genome_of_in- 
terest.informative_genome.blastp.” The ‘-outfmt 6’ option instructs 
BLASTP to format the results into a simple, tab-separated, table. 
The other options are ‘-evalue le-6,’ which sets the maximum 
E-value to le-6; ‘-seg yes -soft_masking true,’ which sets filtering 
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of low information sequences during the blast search, but not during 
the alignment; c -use_sw_tback, ’ which indicates a Smith-Waterman 
alignment to calculate the scores [26] (see Note 5). 

BLAST presents results sorted from best to worst match. Thus, 
a subroutine in PERL that can get the best hits would look like this 
(see Note 6): 

sub get_best_hits { 

my ( $genome_of_interest, $inf ormative_genome) = @_; 
my %best_hits = ( ) ; 
my %E_value = ( ) ; 
my %bit_score = ( ) ; 
my $blast_file 

= "BLAST_RUNS/$genome_of_interest.$informa- 
t ive_ genome . blastp" ; 

open(BLTBL,$blast_file) ; 
while(<BLTBL>) { 

my ( $query_id,$target_id,@stats) = split; 

# both the query and target have a complex name , 

# we only need the gi number to match the neighbor 

# table identifiers 

my ( $query) = $query_id = -/gi\| (\d+)/; 
my ( $tar get) = $tar get_id = ~ /gi\| (\d+) / ; 

# the penultimate value is the E value 
my $E_value = $ stats [ $#stats - 1] ; 

# the last value is the bit score 
my$bit_score = $stats[$#stats]; 

# nowwe actually learn the best hits 
if ( $bit_score{ $query} > 0) { 

if ( 

( $E_value { $ query } == $E_value ) 

&& ($bit_score{$query} == $bit_score) 

) { 

$best_hits{$query} . = ",".$target; 

} 

} 

else { 

$E_value { $query} = $E_value ; 
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Escherichia coli K12, complete genome - 0..4639675 
4237 proteins 


Location 

Strand 

Length 

PID 

Gene 

Synonym 

Code 

COG 

Product 

190..255 

+ 

21 

16127995 

thrL 

b0001 

- 

- 

thr operon leader peptide 

337..2799 

+ 

820 

16127996 

thrA 

b0002 

E 

COG0460 

bifunctional 
aspartokinase 

I/homeserine 
dehydrogenase I 

2801.. 

.3733 

+ 

310 

16127997 

thrB 

b0003 

E 

COG0083 

homoserine kinase 

3734.. 

.5020 

+ 

428 

16127998 

thrC 

b0004 

E 

COG0498 

threonine synthase 

5234.. 

.5530 

+ 

98 

16127999 

yaaX 

b0005 

- 

- 

hypothetical protein 

5683.. 

.6459 

- 

258 

16128000 

yaaA 

b0006 

5 

COG3022 

hypothetical protein 

6529.. 

.7959 

- 

476 

16128001 

yaaJ 

b0007 

E 

C0G1115 

inner membrane transport 
protein 

8238. . 

.9191 

+ 

317 

16128002 

talB 

b0008 

G 

COG0176 

transaldolase 

9306.. 

.9893 

+ 

195 

16128003 

mogA 

b0009 

H 

COG0521 

molybdenum cofactor 
biosynthesis protein 

9928.. 

. 10494 

- 

188 

16128004 

yaaH 

b0010 

S 

C0G1584 

putative regulator, 
integral membrane protein 

10643. 

. .11356 

- 

237 

16128005 

yaaW 

b0011 

S 

C0G4735 

hypothetical protein 

10725. 

. .11315 

+ 

196 

16128006 

htgA 

b0012 



positive regulator for 
sigma 32 heat shock 
promoters 

11382. 

. .11786 

- 

134 

16128007 

yaal 

b0013 

- 

- 

hypothetical protein 

12163. 

. .14079 

+ 

638 

16128008 

dnaK 

b0014 

0 

COG0443 

molecular chaperone DnaK 

14168. 

. . 15298 

+ 

376 

16128009 

dnaJ 

b0015 

0 

COG0484 

chaperone with DnaK; heat 
shock protein 

15445. 

. . 16557 

+ 

370 

16128010 

J 

oo 

b0016 

L 

COG3385 

15186 hypothetical 
protein 

15869. 

. .16177 

- 

102 

16128011 

yi82_1 

b0017 

- 

- 

15186 and 15421 
hypothetical protein 

16751, 

. .16960 


69 

16128012 

mokC 

b0018 



regulatory peptide whose 
translation enables hokC 
(gef) expression 

16751. 

..16903 

- 

50 

49175991 

hokC 

b4412 

“ 

- 

small toxic membrane 
polypeptide 

17489. 

. .18655 

+ 

388 

16128013 

nhaA 

b0019 

P 

COG3004 

Na+/H antiporter, pH 
dependent 

18715. 

. .19620 

+ 

301 

16128014 

nhaR 

b0020 

K 

COG0583 

transcriptional activator 
of cation transport (LysR 
family) 

19811. 

. .20314 

- 

167 

16128015 

insB_l 

b0021 

L 

C0G1662 

ISI protein InsB 

20233. 

. .20508 

- 

91 

16128016 

insA_l 

b0022 

L 

C0G3677 

IS1 protein InsA 

20815. 

. .21078 

- 

87 

16128017 

rpsT 

b0023 

J 

COG0268 

30S ribosomal protein S20 

21181. 

. .21399 

+ 

72 

16128018 

yaaY 

b0024 

- 

- 

unknown CDS 

21407. 

..22348 

+ 

313 

16128019 

ribF 

b0025 

H 

COG0196 

hypothetical protein 

22391. 

. .25207 

+ 

938 

16128020 

i leS 

b0026 

J 

COG0060 

isoleucyl-tRNA synthetase 

25207. 

. .25701 

+ 

164 

16128021 

IspA 

b0027 

M 

COG0597 

signal peptidase II 

25826. 

. .26275 

+ 

149 

16128022 

fkpB 

b0028 

0 

COG1047 

FKBP-type peptidyl-prolyl 
cis-trans isomerase 
(rotamase) 

26277. 

. .27227 

+ 

316 

16128023 

i spH 

b0029 

I 

COG0761 

4-hydroxy-3-methylbut-2- 
enyl diphosphate 
reductase 

27293. 

. .28207 

+ 

304 

16128024 

ri hC 

b0030 

F 

C0G1957 

nucleoside hydrolase 

28374. 

. .29195 

+ 

273 

16128025 

dapB 

b0031 

E 

COG0289 

dihydrodipicolinate 
reductase 

29651. 

. .30799 

+ 

382 

16128026 

carA 

b0032 

E 

COG0505 

carbamoyl-phosphate 
synthase small subunit 

30817. 

. .34038 

+ 

1073 

16128027 

carB 

b0033 

E 

COG0458 

carbamoyl-phosphate 
synthase large subunit 

34195. 

. .34695 

+ 

166 

49175992 

cai F 

b0034 

- 

- 

transcriptional regulator 
of cai operon 

34781. 

. .35392 


203 

16128029 

cai E 

b0035 

R 

COG0663 

possible synthesis of 
cofactor for carnitine 
racemase and dehydratase 

35377. 

. .36270 

- 

297 

16128030 

cai D 

b0036 

I 

COG1024 

carnitinyl-CoA 
dehydratase 

36271. 

. .37839 

- 

522 

49175993 

cai C 

b0037 

- 

- 

crotonobetaine/carnitine- 


CoA ligase 

Fig. 2 A few lines of the PTT table of the genome of Escherichia coli K12. The first column of the PTT (protein¬ 
coding genes) and of the RNT (noncoding genes, those producing rRNAs and tRNAs) tables contains the gene 
coordinates. The second column contains the strand where the gene is found, which is useful for organizing 
the genes into stretches of adjacent genes in the same strand, called directons. The fourth column is the Gl 
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$bit_score{$query} = $bit_score; 

$best_hits{$query} = $target; 

} 

} 

close(BLTBL); 

return(%best_hits); 

} 

3.2 Neighbors The natural next step is to build a database of gene neighbors. The 

Database minimum information that this database should contain is a list of 

adjacent gene pairs and information on the strand on which each 
gene is found. To build this database, a convenient starting point is 
the RefSeq genomes database, available from the NCBI ftp server. 

Several Refseq files could be used to obtain coordinates for each 
gene within the genome. Here I exemplify with the PTT (protein 
table) and RNT (robonucleotide table) files. The PTT file contains a 
table of protein-coding genes, while the RNT file contains a table of 
rRNA and tRNA genes (see Note 4). The first column within these 
tables consists of the gene coordinates. As an example I show a few 
lines of the PTT file for the genome of Escherichia coli K12 [27], 
accession “NC_000913,” version “NC_000913.2 GL49175990” 
(Fig- 2). 

The first column in these tables corresponds to gene coordi¬ 
nates. Thus, the problem of forming pairs of adjacent genes 
becomes trivial. All that is needed is to sort the genes and associate 
each of them with the next gene in the list, formatting them into a 
table, or a database, of Gene Neighbors. The header of the resulting 
table might look like this: 


Gene_a 


Gene_b 


Strands 


Genes in the same strand will have either “++” or “-” in the 

“Strand” column, while genes in different strands will have either 
“H—” (convergently transcribed) or “—j-” (divergently transcribed) 
in this field (see Note 7). 


< - 

Fig. 2 (Continued) number (labeled here as a PID or protein identifier). This number is the best identifier for the 
protein-coding genes in a genome because it is unique. However, in the RNT tables this column is not the best 
identifier; the best identifiers seem to be the gene name (fifth column), and the synonym (sixth column). The 
table in the figure is formatted for display purposes, but the original PTT and RNT tables contain tab-separated 
plain text 
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If the genome is circular a final pair should be formed with 
the last and the first genes in the table. The first line in the GBK 
file indicates whether the replicons are circular or linear (see 

Note 8). 

An example program in PERL that will output this table is: 

1 #!/usr/bin/per1 

2 $die_msg = "\tl need a genome to work with\n\n" ; 

3 $genome_of_interest = $ARGV[0] or die $die_msg; 

4 $die_msg = "\tNo $genome_of_interest directory\n 
\n"; 

5 $genome_dir = "LOCAL_GENOMES/$genome_of_interest"; 

6 opendir(GNMDIR,"$genome_dir") or die $die_msg; 

7 @ptt_files = grep {/\.ptt/} readdir (GNMDIR) ; 

8 $results_dir = "NEIGHBORS"; 

9 mkdir($results_dir) unless (-d$results_dir); 

10 open(NGHTBL,"> $results_dir/$genome_of_interest. 

nghtbl"); 

11 for my $ptt_f ile (@ptt_files) { 

12 # get proper name of the RNT and GBK files 

13 my$rnt_file = $ptt_file; 

14 my$gbk_file = $ptt_file; 

15 $rnt_f ile = ~ s/\.ptt/\. rnt/; 

16 $gbk_f ile = ~ s/\.ptt/\. gbk/; 

17 # Is the genome circular? 

18 # The information is in the first line of the GBK 

19 # f ile , which starts with the word "LOCUS" 

20 my $circular = "yes" ; # make circular the default 

21 open (GBK,"$genome_dir/$gbk_file"); 

22 while(<GBK>) { 

23 if(/ A LOCUS/) { 

24 $circular = "no" if(/linear/i) ; 

25 last; # we do not need to read any further 

26 } 

27 } 

28 # nowwe read the table of protein coding genes 

29 # and their "leftmost" coordinate so we can 

30 # order them and f ind the neighbors 
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31 my%strand= (); 

32 my%coords= (); 

3 3 my @ids = ( ) ; 

34 open(PTT,"$genome_dir/$ptt_file"); 

35 while(<PTT>) { 

36 my @data = split; 

37 next unless ( $data [ 1] =~/ A \+ | \— $ /) ; 

38 $gi = $data[3] ; 

39 $strand{$gi} = $data[l]; 

40 my($coord) = $data [0] = ~ / A (\d+) / ; 

41 $coord{$gi} = $coord; 

42 } 

43 close(PTT); 

44 # we verify that there is a table of rRNA and tRNA 
genes 

45 # if so, we get the genes 

46 if (-f "$genome_dir/$rnt_f ile" ) { 

47 open (RNT, " $genome_dir/$rnt_f ile " ) ; 

48 while (<RNT>) { 

49 my @data = split ; 

50 next unless ( $data [ 1] = ~/ A \ + | \— $ /) ; 

51 

52 # The ident if ier is not a GI 

53 # but I rather keep the var iable names consistent 

54 # the best identifier for an 'RNA' gene is 

55 # the gene name (5th column) 

56 my $gi = $data [ 4 ] ; 

57 $strand{$gi} = $data[l]; 

58 my($coord) = $data [0] = ~ / A (\d+) /; 

59 $coord{$gi} = $coord; 

60 } 

61 } 

62 # nowwe build the table of direct neighbors 

63 my @ids = sort {$coord{$a} < = > $coord{$b}} keys 
% c o o r d ; 

64 formy$i(0.. $#ids) { 
if (exists($strand{$ids[$i+1]})) { 


65 
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66 my $ st r = $strand{$ids[$i]}.$strand{$ids 
[$i+ 1] }; 

67 print NGHTBL $ids [ $i] , "\t" ,$ids [$i + 1] 

$str, "\n"; 

68 } 

69 else { 

70 if ($circular eq "yes") { 

71 my $str = $strand{$ids[$i]}.$strand{$ids[0] } ; 

72 print NGHTBL $ids[$i] , "\t" , $ ids [ 0 ] , "\t" , $st r , 

"\n"; 

73 } 

74 } 

75 } 

76 } 

77 

78 close(NGHTBL) ; 

and a subroutine that will read this table, learn the neighbors, and 
classify them as same-strand and opposite-strand neighbors is: 

sub get_strands_of_neighbors { 
my $genome = $_[0] ; 

# we will learn the neighbors as hashes where the keys 

# are the neighbor pairs of genes and the values are 

# the strand situations (same strand or opposite 
strand 

my %strands_of = ( ) ; 

open(NGH,"NEIGHBORS/$genome.nghtb1"); 
while(<NGH>) { 

my($gi,$gj,$strand) = split; 
my $neighbors = join(",",sort($gi,$gj)); 
if ( ( $strand eq ) | | ($strand eq "++")) { 
$strands_of{"$neighbors"} = "same"; 

} 

els if ( ( $ st rand eq + " ) | | ( $ strand eq " + - " ) ) { 
$strands_of{"$neighbors"} = "opp"; 

} 

> 

return(%strands_of); 

} 
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3.3 Putting 
Everything Together 


As originally defined, the Gene Neighbor Method aims to find 
genes with conserved adjacency in evolutionarily distant genomes. 
However, Ermolaeva et al. [13] have obviated the need for a 
phylogenetic distance by using the genomes themselves to deter¬ 
mine the significance of the conservation in the form of a confi¬ 
dence value. The idea behind the confidence value is that the 
proportion of conserved adjacencies in opposite strands represents 
conservation due to chance alone, or more properly, conservation 
due to short evolutionary distance and chance rearrangement (see 
Note 9). A simplified version of the confidence value calculated 
under the same assumption is: 

C = 1-0.5 

P Same 

The confidence value (C) can be thought of as a positive predictive 
value (true positives divided by the total number of predictions) for 
two genes to be conserved due to a functional interaction (they 
would be in the same operon) (see Note 10). The value 0.5 in this 
expression is a prior probability for the genes to be in different 
transcription units. P Qpp is the count of pairs of orthologs con¬ 
served next to each other in opposite strands (“4—” and “—b” pairs 
of neighbor genes) divided by the total number of neighbors in 
opposite strands in the informative genome. Psame is the count of 
orthologs conserved next to each other in the same strand (“++” 

and “-”) divided by the total number of neighbors in the same 

strand in the informative genome. 

Now, with all the necessary data, neighbors, and best hits, and 
with a way of calculating a confidence value, the previous pseudo¬ 
code is modified as: 

GENE_NEIGHBOR_METHOD 

1 for each informative_genome 

2 count_conserved <- 0 

3 conserved_list<-"" 

4 for each NEIGHBORS ( a , b) in genome_of_in teres t 

5 if (ORTH (a) AND ORTH (jb) ) in informative_genome 

6 if (same-strand (ORTH (a) ,ORTH(jb) ) ) in informative_ 
genome 

7 ADD(a,jb) to conserved_same-strand 

8 count_same < - count_same + 1 

9 else if (opposite-strand(ORTH(a),ORTH(b)) 

10 ADD(a,£>) to conserved_opposite-strand 

11 count_opposite <-count_opposite + 1 
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12 confidence<- 1-0.5*proportion(same)/propor¬ 
tion (opposite) 

13 return ( informative_genome, confidence, conser- 
v e d_s ame-strand) 

And a particular example program in PERL would be: 

1 #1/usr/bin/per 1 

2 $genome_of_interest = "Escherichia_coli_K12"; 

3 @genomes = qw( 

4 Salmonella_typhi_Ty2 

5 Yersinia_pestis_KIM 

6 Rhizobium_etli_CFN_42 

7 Bacillus_subtilis 

8 ) ; 

9 $results_dir = "Confidence"; 

10 mkdir ( $r esult s_dir ) unless ( -d $r esults_dir ) ; 

11 my %strands_of = get_strands_of_neighbors 
($genome_of_interest); 

12 open(CONF , " >$resu11s_dir/$genome_of_ 

interest.confidence"); 

13 for my $informative_genome (Ogenomes) { 

14 print $informative_genome,"\n" ; 

15 my%best_hits 

16 = get_best_hits($genome_of_interest,$ 
informative_genome); 

17 my %inf_strands_of 

18 =get_str ands_of_neighbor s($ informative_ 
genome); 

19 my $count_same = 0; 

20 my $count_opp = 0; 

21 my @predictions; 

22 f or my $neighbors (keys %strands_of) { 

23 my($gi,$gj) = split(/,/,$neighbors); 

24 # first see if there are any orthologs 

25 if (exists($best_hits{$gi}) 

26 &&exists($best_hits{$gj})){ 

27 # since there might be more than one ortho- 
log, and 

28 # there might be more than one conserved 
pair, 

29 # we use a "flag" (count_conserv = "none") to 
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30 # avoid"overcounting" 

31 my $count_conserv = "none"; 

32 

3 3 # now the actual verification of conservation 

34 for my $orth_i (split(/,/,$best_hits{$gi})) { 

35 f or my $orth_j ( split (/,/, $best_hit s { $gj } ) ) { 

36 my $t e st_ne igh = join",", sort ( $orth_i, 
$orth_j); 

37 if ($inf_strands_of{$test_neigh} 

38 eq $strands_of{$neighbors}) { 

39 $ count_c onse r v = $strands_of { $neigh 
bors}; 

40 } 

41 > 

42 } 

43 

44 # now we verify the flag and count any conservation 

45 if ( $count_conserv eq "same" ) { 

46 $count_same++; 

47 push(@predictions,$neighbors); 

48 } 

49 elsif ( $count_conserv eq "opp" ) { 

50 $count_opp++; 

51 } 

52 } 

53 } 

54 # nowwe also need to count the number of genes in the 
same 

55 # strand and those in opposite strands in the infor¬ 
mative genome 

56 my $total_same = 0; 

57 my $total_opp = 0; 

58 f or my $inf_ngh (keys %inf_strands_of ) { 

59 if ( $inf_strands_of { $inf_ngh} eq "same" ) { 

60 $total_same++; 

61 > 

62 elsif($inf_strands_of{$inf_ngh} eq "opp") { 

63 $total_opp++; 

64 } 

65 } 
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66 

67 # nowwe can calculate the confidence value 

68 my $P_same = $count_same/$total_same ; 

69 my $P_opp = $count_opp/$total_opp ; 

70 my $conf = 1-0.5* ( $P_opp/$P_same) ; 

71 $conf = sprintf("%.2f",$conf); 

72 print "CONFIDENCE = " , $conf , "\n" ; 

73 # now print predictions with their confidence 
values 

74 for my $prediction (@predictions) { 

75 $pr edict ion = ~ s/,/\t/; 

76 pr in tCONF$p rediction, "\t " , $ conf , "\t ", $ inf o 
rmat ive_genome , "\n" ; 

77 } 

78 } 

When run, this program creates a single file with conserved 
neighbors, their confidence values, and the genome from which the 
value was obtained. At the same time, the program prints the 
following output to the display: 

% ./neighbor-method.pi 
Salmonella_typhi_Ty2 
CONFIDENCE =0.55 
Yersinia_pestis_KIM 
CONFIDENCE =0.73 
Rhizobium_etli_CFN_42 
CONFIDENCE =0.99 
Bacillus_subtilis 
CONFIDENCE =1.00 

The informative genomes are ordered evolutionarily from clos¬ 
est to farthest. As expected, the evolutionarily closest organism to 
E. coli K12 in this example, Salmonella typhi Ty2, gives the lowest 
confidence value, while the farthest gives the maximum confidence 
value. The threshold I use to accept predictions is a confidence 
value > 0.95. 


4 Notes 


1. Traditionally, UNIX users might have used the ftp program to 
transfer files. The newer programs, wget and rsync, offer 
options that might help transferring more than one file with a 
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single command. Many servers lack rsync capabilities, and then 
wget can be used. It is still possible to use the ftp command for 
this task. Conveniently, ftp sites can be displayed in a web 
browser, the user can then find the files that might be of 
interest, and then download them. 

2. As of this writing, NCBI has reorganized its RefSeq Genome 
data server. The Bacteria subdirectory has been deleted. The 
new directory structure is very complicated, which makes me 
think that people wanting to work with all the complete gen¬ 
omes will have to access them using a program. Some of the 
changes at NCBI’s server allow access to several assemblies for 
each genome, thus complicating the automatic decision as to 
which assembly to download. I can advise little more now than 
consulting the assembly file in order to decide what to 
download: 

rsync -avzL \ 

rsync://rsync.ncbi.nlm.nih.gov/refseq/assem- 
bly_summary_refseq.txt 

3. For finding orthologous genes, what we compare is the proteins 
encoded by the annotated genes in one genome, against the 
proteins encoded by the annotated genes in the other. Genes 
producing directly active RNA, such as tRNA and rRNA genes, 
are mostly ignored in these kinds of analyses, perhaps because 
they are fewer than the coding genes, and because comparing 
DNA sequences and thus determining orthology, especially 
among evolutionarily distant organisms, can be very difficult. 

4. In order for BLASTP to run, the protein sequences found in 
the files ending with “.faa” (FAA file) have to be formatted into 
BLAST databases. I prefer to keep each genome separated so it is 
simpler to update results when a new genome is published. The 
main caveat to this approach is that some prokaryotic genomes 
contain more than one replicon. This means that there will be 
more than one FAA file for these genomes. It is better to have all 
the protein sequences in a single file. Thus, I concatenate all the 
FAA files within the directory of each genome into a single file. A 
simple UNIX command that can do this job is: 

cat genome_of_interest/*.faa > FAADB/genome_of_ 
interest.faa 

A file compressed with gzip, like the one used under Note 5, 
would be obtained as follows: 

cat genome_of_interest/*.faa | gzip -9 > FAADB/ 
genome_of_interest.faa.gz 

The “—9” gzip option calls for maximum compression. To 
build BLAST databases the command is: 
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makeblastdb -dbtype prot -in FAADB/genome_ 
of_int er est. f aa -par se_seqids \ 

-hash_index -out BLASTDB/genome_of_interest 
-title ““genome_of_int er est” 

5. Given blast’s UNIX heritage, the command can be “piped.” 
Because of this important feature, blast results can be com¬ 
pressed as they are produced, if needed. This can be accom¬ 
plished by taking advantage of blast’s default output being the 
standard output (the screen), which can be piped into the gzip 
command (or bzip2): 

blastp -query genome_of_int er est. f aa -db inf orma- 
tive_genome -evalue le-4 \ 

-seg yes -soft_masking true -use_sw_tback -outfmt 
7 | gzip -9 > \ 

genome_of_interest. inf ormative_genome .blastp . gz 

Piping can also be advantageous to run blast when the query 
fasta file is also compressed: 

gzip -qdc genome_of_interest.faa.gz | blastp - 
query - \ 

-db inf ormat ive_genome -evalue le-4\ 

-seg yes -soft_masking true -use_sw_tback -outfmt 
7 | gzip -9 > \ 

genome_of_inter est. inf ormat ive_genome .blastp . gz 

The “-query -” option is not really necessary (though I prefer 
using explicit options, to easily understand what is going on 
when checking commands later on), because the default query 
is “-” (standard input): 

gzip-qdc genome_of_int er est. f aa. gz | blastp \ 

-db inf ormat ive_genome -evalue le-4\ 

-seg yes -soft_masking true -use_sw_tback -outfmt 
7 | gzip -9 > \ 

genome_of_inter est. inf ormat ive_genome .blastp . gz 

6. It might be tempting to use blastp’s option for displaying only one 
matching sequence per query sequence (-max_target_seqs 1). 
However, there can be more than just one best hit. Yet the option 
would only display one. Cases where more than one best hit exists 
are not very frequent, but they happen. It is up to the user to decide 
whether to use this option and save downstream computation. 

7. It is also possible to allow gaps (i.e., intervening genes) 
between gene pairs. However, in my experience, allowing 
gaps neither improves, nor worsens the results. This assessment 
is based on knowledge of the operons in Escherichia coli K12. 
However, allowing gaps might facilitate calculation of confi¬ 
dence values in very small genomes, where the number of 
same- and opposite-strand genes might be too small. If gaps 
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are used, it is important that the pairs of genes are part of the 
same stretch of genes in the same strand with no intervening 
genes in the opposite strand (such stretches are called direc- 
tons). For opposite-strand genes it will be enough to confirm 
that they are in different strands. The extreme example is the 
same-directon versus different-directon approach. The conser¬ 
vation to be evaluated would be that of two genes in the same 
directon, regardless of the number of genes in between. The 
control, or negative set, would consist of genes in different, yet 
adjacent, directons. This is very similar to a method that is now 
used at The Institute for Genomics Research (Maria Ermo¬ 
laeva, personal communication), which is a simplified version 
of a method published by Ermolaeva et al. [13]. A program 
that will output a database of genes in the same directon, and 
genes in different directons, would be: 

1 #1/usr/bin/per1 

2 $genome_of_interest = $ARGV[0] or die "I need a 
genome to work with\n\n" ; 

3 $genome_dir = "LOCAL_GENOME S/ 
$genome_of_interest"; 

4 opendir(GNMDIR,"$genome_dir") or die$die_msg; 

5 @ptt_files = grep {/\.ptt/} readdir (GNMDIR) ; 

6 $results_dir = "NEIGHBORS_DIRECTON"; 

7 mkdir($results_dir) unless (-d $results_dir) ; 

8 open(NGHTBL," > $results_dir/$genome_of_interest. 
nghtbl"); 

9 PTT: 

10 for my $ptt_f ile (@ptt_files) { 

11 #getproper name of the RNT and GBK f iles 

12 my$rnt_file = $ptt_file; 

13 my$gbk_file = $ptt_file; 

14 $rnt_f ile = ~ s/\.ptt/\. rnt/; 

15 $gbk_f ile = ~ s/\.ptt/\. gbk/ ; 

16 # Is the genome circular? 

17 # The information is in the first line of the 
"gbk" 

18 # f ile , which starts with the word "LOCUS" 

19 my $circular = "yes" ; # make circular the 
default 

20 open (GBK,"$genome_dir/$gbk_file"); 

21 while(<GBK>) { 
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22 if (/ A LOCUS/ ) { 

23 $circular = "no" if(/linear/i) ; 

24 last; # we do not need to read any further 

25 } 

26 } 

2 7 # now we read the table os protein coding genes 

28 # and their "leftmost" coordinate so we can 

29 # order them and f ind the neighbors 

30 my%strand= (); 

31 my%coord= (); 

32 open(PTT,"$genome_dir/$ptt_file"); 

33 while(<PTT>) { 

34 my @data = split ; 

35 next unless ( $data [ 1] =~/ A \+ |\-$/) ; 

36 my $gi = $data [ 3 ] ; 

37 $strand{$gi} = $data[l]; 

38 my($coord) = $data [0] = ~ / A (\d+) /; 

39 $coord{$gi} = $coord; 

40 } 

41 close(PTT); 

42 if (-f " $genome_dir/$rnt_f ile" ) { 

43 open(RNT,"$genome_dir/$rnt_file"); 

44 while (<RNT>) { 

45 my @data = split ; 

46 next unless ( $data [ 1] = ~/ A \ + |\-$/) ; 

47 

48 # The identif ier is not a GI 

49 # but I rather keep the variable names 

consistent 

50 # the best identifier for an 'RNA' gene is 

51 # the gene name (5th column) 

52 my $gi = $data [ 4 ] ; 

53 $strand{$gi} = $data[l] ; 

54 my($coord) = $data [0] = ~ / A (\d+) / ; 

55 $coord{$gi} = $coord; 

56 } 

57 } 
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58 # we build directons: stretches of genes in the 
same 

59 # strandwithno intervening gene in the opposite 

60 # strand 

61 my @ids = sort {$coord{$a} <= > $coord{$b}} 
keys %coord; 

62 my@directon= (); 

63 my$directon; 

64 $prev_str = "none" ; 

65 f or my $gi ( @ids) { 

66 if ( $strand{ $gi} eq $pr ev_str ) { 

67 $directon . = ",".$gi; 

68 $prev_str = $strand{$gi}; 

69 } 

70 else { 

71 push(@directon,$directon) if (defined 

$directon); 

72 $directon = $gi; 

73 $prev_str = $strand{$gi}; 

74 } 

75 } 

76 

77 # with circular genomes we make sure that 

78 #we close the circle, meaning if first and last 

79 # direct on are in the same strand, they forma single 

80 # directon 

81 if ($strand{$ids[0]} eq$strand{$ids[$#ids]}) { 

82 if ( $circular eq "yes" ) { 

83 $directon[0] = $directon.",".$directon[0]; 

84 } 

85 else { 

86 push(@directon,$directon); 

87 } 

88 } 

89 else { 

90 push(@directon,$directon); 

91 } 


92 
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93 # nowwe do formpairs in same directon, and 

94 # pair s in differ ent dir ectons 

95 for my $i (0 .. $#directon) { 

96 my @gi = split(/,/,$directon[$i]); 

97 # same directon 

98 my @expendable = @gi; 

99 while (my $gi = shift @expendable) { 

100 formy$gj ( @expendable) { 

101 print NGHTBL $gi, "\t M , $gj 

102 , "\t" , $strand{ $gi} . $strand{ $gj } , "\n 

103 } 

104 } 

105 ## differ ent dir ecton 

106 ## assuming circular replicons 

107 my $next_directon = "none"; 

108 if ($i < $#directon) { 

109 $next_directon = $directon[$i + 1] ; 

110 } 

111 else { 

112 if ( $circular eq "yes" ) { 

113 $next_directon = $directon[0]; 

114 } 

115 else { 

116 next PTT; 

117 } 

118 } 

119 my @gj = split(/,/,$next_directon) ; 

120 formy$gi (@gi) { 

121 for my $gj (@gj ) { 

122 print NGHTBL $gi, "\t" , $gj 

12 3 , "\t" , $strand{ $gi} . $strand{ $gj } , "\n 

124 } 

125 } 

126 } 

127 } 

128 close(NGHTBL); 



Inferring Functional Relationships from Conservation of Gene Order 


61 


8. It is important to know that some of the Prokaryotic genomes 
reported so far have more than one replicon, meaning more 
than one DNA molecule. Multireplicon genomes can contain 
two or more chromosomes, mega-plasmids, and plasmids. I 
consider all the published replicons part of the genome, and 
thus the programs presented are designed to read all of the 
replicons under a given genome directory. 

9. As stated, Overbeek et al. [3] noted that some divergently 
transcribed genes could be functionally related, but found 
that the proportion of conserved, divergently transcribed 
genes across evolutionarily distant species was very small. The 
main effect of this possibility is that the confidence value would 
be an underestimate. This is clear in the analyses presented by 
Ermolaeva et al. [13], and in the particular examination of false 
positives presented by Janga et al. [14], who found indepen¬ 
dent evidence that almost all of their false positives had a 
functional relationship (see also Note 6). In these analyses, the 
confidence value of 0.95 seems to correspond to a positive 
predictive value (true positives divided by the total number of 
predictions) of 0.98. 

10. The relationship between the positive predictive value and the 
confidence value has been established [13, 14] using data on 
experimentally determined operons of Eseheriehia eoli K12 
from RegulonDB [28]. Another useful statistic is coverage 
(also called sensitivity: true positives divided by the total num¬ 
ber of truly related pairs). For protein-coding genes, the cur¬ 
rent estimate for most genomes is that 0.5 of all same-strand 
direct neighbors might be in the same operon. In E. eoli K12, 
the total number of same-strand protein-coding genes is 2930. 
Thus, the total number of functionally related neighbors is 
approximately 2930/2 = 1465. The maximum number of 
predictions for E. eoli K12 compared against all the genomes 
in the current database is 640 at a confidence value > 0.95. 
Thus, the estimated coverage is: 640 * 0.95 / 1465 = 0.41. 
This coverage might be thought low, but the predictions are of 
excellent quality. 
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Chapter 4 


Structural and Functional Annotation of Long 
Noncoding RNAs 

Martin A. Smith and John S. Mattick 

Abstract 

Protein-coding RNAs represent only a small fraction of the transcriptional output in higher eukaryotes. The 
remaining RNA species encompass a broad range of molecular functions and regulatory roles, a conse¬ 
quence of the structural polyvalence of RNA polymers. Albeit several classes of small noncoding RNAs are 
relatively well characterized, the accessibility of affordable high-throughput sequencing is generating a 
wealth of novel, unannotated transcripts, especially long noncoding RNAs (IncRNAs) that are derived from 
genomic regions that are antisense, intronic, intergenic, and overlapping protein-coding loci. Parsing and 
characterizing the functions of noncoding RNAs—IncRNAs in particular—is one of the great challenges of 
modern genome biology. Here we discuss concepts and computational methods for the identification of 
structural domains in IncRNAs from genomic and transcriptomic data. In the first part, we briefly review 
how to identify RNA structural motifs in individual IncRNAs. In the second part, we describe how to 
leverage the evolutionary dynamics of structured RNAs in a computationally efficient screen to detect 
putative functional IncRNA motifs using comparative genomics. 

Key words IncRNA, Comparative genomics, RNA secondary structure, Homology search, Func¬ 
tional genome annotation 


1 Introduction 


Functional genome annotation involves the identification of both 
known and hypothetical genes in uncharacterized genomic DNA 
sequence. This largely includes protein-coding genes and noncod¬ 
ing RNAs, as well as other genomic features such as telomeric/ 
subtelomeric regions and centromeres. The identification of 
protein-coding genes can unravel the molecular repertoire of the 
majority of the genomes of microorganisms, especially prokaryotes, 
whose genomes are largely composed of protein-coding sequences. 
However, protein-coding sequences encompass only a small frac¬ 
tion of the genome in higher eukaryotes, which decreases with 
increasing developmental and cognitive complexity [1, 2] and 
comprise less than 1.5 % of the human genome. 
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Most of the human genome is dynamically transcribed into 
RNA [3, 4], which implies that untranslated RNAs compose the 
most abundant class of genomic output. In particular, noncoding 
transcripts greater than 200 nt in length—long noncoding RNAs 
(IncRNAs)—are emerging as master regulators of development and 
differentiation in higher eukaryotes [5-10]. There are currently 
15,767 IncRNA genes (excluding alternative isoforms and pseudo¬ 
genes) listed in version 25 of the GENCODE human genome 
annotation database, compared to 19,950 protein-coding genes. 
Contrary to protein-coding genes, whose set is relatively well char¬ 
acterized and has remained relatively stable in number and reper¬ 
toire throughout metazoan evolution [1, 2, 11], although there are 
novel genes mainly encoding small proteins being discovered [12], 
the number of identified IncRNAs is steadily increasing as more and 
more biological conditions are investigated with high-throughput 
RNA sequencing technologies. 

Many IncRNAs appear to regulate gene expression through 
their association with epigenetic proteins, such as histone modifica¬ 
tion enzymes and DNA methyltransferases, with which they syner- 
gistically organize the nuclear environment [13-15]. Other 
IncRNA functions include acting as molecular decoys and macro- 
molecular scaffolds, as well as the regulation of splicing and trans¬ 
lation, mRNA stability, and the formation of subcellular organelles 
[5, 16]. A small but growing number of IncRNAs have been 
functionally validated through knockout and ectopic expression 
in vivo and in cell culture, and other biochemical studies [17-20], 
but the precise molecular mechanisms and structures guiding their 
function remain largely unresolved. 

At present, IncRNAs are largely categorized by their position 
relative to neighboring protein-coding genes, i.e., intergenic, anti- 
sense, intronic, or bidirectional. However, the particular functions 
of IncRNAs do not necessarily correlate with their genomic con¬ 
text. For example, the IncRNA HOTAIR functions by recruiting a 
chromatin modification complex (PRC2) to repress gene expres¬ 
sion in tmns [21], whereas the IncRNA HOTTIP recruits another 
epigenetic complex (WDR5-MLL1) in cis to activate gene expres¬ 
sion via chromosomal looping [22]. Both are situated in the inter¬ 
genic regions surrounding HOX genes. The functional annotation 
of IncRNAs at a genome- or transcrip tome-wide scale therefore 
requires the consideration of additional molecular features that 
may be unique to each transcript. 

A unifying feature of ncRNAs is their propensity to form dis¬ 
crete secondary and tertiary structures through canonical and non- 
canonical nucleotide base pairings that often dictate their function. 
Many IncRNAs appear to be very plastic, evolve quickly, and/or 
have arisen relatively recently in evolution, as evidenced by high 
turnover rates and reduced primary sequence conservation [23, 
24], although there are exceptions that have extraordinarily high 
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levels of sequence conservation [25-27]. Their evolutionary 
dynamics are different from protein-coding genes, displaying 
relaxed structure-function constraints that are synonymous with 
being under positive selection for adaptive radiation. They are in 
general (although there are likely to be exceptions) unlikely to have 
catalytic activities, such as ribosomal RNAs, yet may nonetheless 
form evolutionarily stable, functional secondary and tertiary struc¬ 
tures with different functions, as well as shorter primary sequences 
that may interact with other RNAs and DNA. For instance, the 
widespread presence of repetitive sequences derived from mobile 
elements in the human genome is believed to contribute to modu¬ 
lar IncRNA biogenesis by forming a reservoir of functional motifs— 
or structured templates for RNA-binding proteins—that can be co¬ 
opted into RNA regulatory networks via positive selection [28-30]. 

Computational identification of functional RNA structural 
motifs encoded in genomic sequences is a challenging task, mainly 
because almost any RNA sequence can form internal base pairs via 
classical Watson-Crick, Hoogstein, or ribose 2'OH hydrogen bond 
formation, and fold into discrete structures [31, 32], but also 
because RNA structures themselves are dynamic, flexible, and are 
contingent on the cellular environment (i.e., temperature, ion con¬ 
centrations, ligand binding, transcriptional kinetics). Functional 
RNA structures can nonetheless be identified through comparative 
genomics by observing nucleotide substitutions that are consistent 
and compatible with a common structural topology. Indeed, a 
much larger fraction of the human genome seems to function 
through the formation of RNA structure motifs than through 
sequence-constrained elements, as evidenced by considering nucle¬ 
otide covariation events in evolutionary information [33]. 

In this chapter, we describe how to annotate ncRNAs in geno¬ 
mic or transcriptomic data, where known or putative functions are 
assigned to uncharacterized sequences to gain insight into their 
biology. First, we summarize how to identify functional RNA ele¬ 
ments in single sequences via homology search as well as prediction 
of local structures in long transcripts. Finally, we describe how to 
identify putative functional motifs in IncRNAs that are supported 
by evolutionarily conserved RNA secondary structures. We provide 
user friendly, step-by-step instructions on how to perform a multi¬ 
ple genome-wide screen for functional RNA motifs similar to that 
published in [33]. 


2 Materials 


A UNIX-based computing environment should be employed for 
most of the described methods, preferably with access to a high- 
performance computing infrastructure. Alternatively, a computer 
or server with multiple processors and over 4 GB of RAM may be 
employed. 
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2.1 Genomic Data 


2.2 Transcriptomic 
Data 


2.3 Multiple Genome 
Alignments 


Genomic or transcriptomic sequence data should be downloaded 
and converted (if required) to fast a file format, unless it is already 
available. Genomic data for reference organisms can be obtained 
from the following sources: 

1. UCSC genome browser—select the organism and the desired 
genome version, then full data set, then the file with suffix “fa. 
£fz ” at http://hgdownload.cse.ucsc.edu/downloads.html. 

2. NCBI—select the species of interest and then sequence data 
can be downloaded for each chromosome individually at 
(ftp://ftp.ncbi.nih.gov/genomes/). A FTP batch download 
tool or interface should be considered to automate the process. 

3. ENSEMBL genome browser—select the appropriate release 
version, then c fasta’ at ftp://ftp.ensembl.org/pub/. 

LncRNAs are often spliced (including alternatively spliced), gener¬ 
ating sequences and structures that would otherwise be missed 
during computational screens of unprocessed genomic sequences. 
Depending on the task at hand and the availability of suitable data, 
the sequences corresponding to processed transcripts should also 
be considered to improve the robustness of functional IncRNA 
annotation. For RNA sequencing data, algorithms for de novo 
assembly should be considered provided the depth of coverage is 
sufficient. These programs usually produce output files containing 
genomic coordinates in .bed (browser extendible data file, prefera¬ 
bly in 12-field format with exon boundary information), .£[tf(ge ne 
transfer format), .jjff (general feature format), or similar formats. 
The popular Cufflinks program from the Tuxedo suite of RNAseq 
tools [34] produces a .jytf file and includes the appropriate 
software—a program called jyffread located in the Cufflinks binary 
folder—to extract and process sequence information from a refer¬ 
ence genome into a fast a file. Alternatively, the Trinity program for 
de novo transcriptome assembly without aligning to a reference 
genome [35] directly outputs a fast a file of assembled transcripts 
from the fastq files containing deep sequencing data. 

Comparative genomics approaches for functional annotation of 
noncoding RNAs require pairwise or multiple genome alignments 
for the species of interest. Prealigned genomic sequence alignments 
for most well-studied vertebrates can be downloaded in .maf( mul¬ 
tiple alignment format) from the ENSEMBL comparative geno¬ 
mics database [36] or from the UCSC genome browser [37] — 
which also hosts alignments for nonvertebrate species—as follows: 

1. ENSEMBL Compara—Information about downloading mul¬ 
tiple genome alignments is available at http://ensembl.org/ 
info/data/ftp/index.html. Multiple alignments in .maf from 
the latest release at the time this was written can be downloaded 
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via FTP protocol at ftp://ftp.ensembl.org/pub/release-85/ 
maf/ ensembl-compara/ multiple_alignments/. 

2. UCSC Genome Browser—Navigate to the table browser tab at 
http://genome.ucsc.edu (select Tools,’ then Table browser’ 
from the drop-down menu bar on the top of the page). Select 
the reference species of interest, then ‘Comparative Genomics’ 
from the group menu, ‘Conservation’ from the track menu, 
and ‘Multiz Align,’ form the table menu. Optionally, regions 
can be limited to an existing UCSC or custom track (which 
needs to be uploaded independently prior to this step). This 
can significantly reduce the size of the download when only 
interested in a set of transcripts, for example. Next, ensure that 
‘MAF—multiple alignment format’ appears in the output for¬ 
mat menu, otherwise the appropriate track or table must be 
selected. Finally, name the output file and get output (ideally, 
compressed) or send the output to the Galaxy [38] platform for 
post-processing (see later). 

Multiple alignments form the UCSC Genome Browser employ 
a different synteny and alignment algorithm than those from 
ENSEMBL. The latter usually present contiguous alignments for 
large syntenic blocks via the Enredo (or Mercator) and Pecan algo¬ 
rithms [39, 40], whereas the former is optimized for total genomic 
coverage and presents smaller, fragmented alignment blocks as pro¬ 
duced with the TEA and MULTIZ algorithms [41 ]. Because of their 
highly fragmented nature and variable presence of each species in 
each block, TBA/MULTIZ alignments may require additional pro¬ 
cessing, such as being ‘stitched’ together. A good summary of 
approaches for processing .maf files is described by Blankenberg 
et al. [42]. The ENSEMBL alignments require less processing, as 
the syntenic blocks are much longer. These alignments can also 
contain segmental duplications, which should be removed at the 
user’s discretion (ensuring that the coordinates of the segmental 
duplications for the reference species are saved for future reference). 


3 Methods 


The first step in any analysis of a putative noncoding RNA is to 
estimate its protein-coding potential. This typically involves exclud¬ 
ing known protein-coding genes from a reference genome annota¬ 
tion, from mass spectrometry data (when available), as well as 
computational estimation of coding potential via the analysis of 
open reading frames and evolutionary information, such as synon¬ 
ymous codon usage. The Pinstripe software suite is one example of 
a recently developed bioinformatics resource that enables the dis¬ 
crimination of coding versus noncoding transcripts, which is 
accompanied by a well-described usage manual [ 12 ]. Such methods 
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and additional considerations—i.e., bifunctional RNA transcripts 
that are both mRNAs and ncRNAs—are reviewed in [43, 44]. 

There are two general approaches for the functional annotation 
of noncoding RNAs: (1) homology search against known RNAs; 
and (2) de novo identification of putative functional domains. The 
former is more suitable for the annotation of small RNAs (e.g., 
tRNAs, snoRNAs, 5S rRNAs, snRNAs, miRNAs, etc.); however, an 
increasing number of IncRNAs have been sufficiently characterized 
and are amenable to this approach (see [45] and the most recent 
release of RFAM). De novo computational annotation of noncod¬ 
ing RNAs can be applied to both size categories of transcripts and 
involves the elucidation of both sequence and structural character¬ 
istics that are indicative of function. Comparison of sequence simi¬ 
larity to orthologous genes, for instance, with BLAST [46], is a 
commonly employed method for the identification of protein¬ 
coding genes and ribosomal RNAs given their strong dependence 
on sequence composition as well as crucial cellular functions. How¬ 
ever, when comparing genes with similar functions across larger 
evolutionary distances, sequence homology is outclassed by struc¬ 
tural homology, where classical sequence alignment methods are 
inefficient. Hidden Markov models [47, 48] and codon substitu¬ 
tion matrices (e.g., PAM [49] or BLOSUM [50]) are employed to 
overcome the sequence alignment barrier when faced with greater 
sequence divergence than for protein-coding genes. 

For noncoding RNAs, alternative computational strategies 
must be employed to overcome the increased diversity of sequences 
that are compatible with a given secondary or tertiary structure. 
The evolutionary dynamics of noncoding RNAs are governed by 
three factors: (1) They do not require the preservation of sequence 
composition to convey a genetic code, i.e., codons, with the nota¬ 
ble exception of the anticodon loop in tRNAs. (2) RNA structures 
are more tolerant to nucleotide substitutions than proteins for 
mutated codons. Indeed, 6 out of 16 possible canonical ribonucle¬ 
otide combinations will form canonical base pairings, which include 
Watson-Crick and G-U/U-G ‘wobble’ base pairs. Because RNA 
structures can accommodate a higher frequency of base substitu¬ 
tions than mRNAs—as long as they are consistent or compatible 
with their paired nucleotide—bioinformatics tools investigating 
noncoding RNAs must focus on secondary and tertiary structural 
characteristics as well as primary sequence, where short patches of 
high conservation may indicate important biochemical interac¬ 
tions. (3) Since their biological function is often of regulatory 
nature, they are more likely to be under positive selection for 
adaptive radiation. This is most notable for IncRNAs. 

3.1 Detecting 
Homology to Known 
Functional RNAs 


The RFAM database encompasses several well-characterized non¬ 
coding RNA families that are presented in multiple alignments 
based on both their sequence and higher order structure topolo¬ 
gies [51]. Until recently, the RFAM repository was mostly limited 
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3.2 Predicting the 
Structural Landscape 
of Individual IncRNA 
Sequences 


to entire RNA sequences, mainly small noncoding RNAs. Recent 
updates to RFAM have expanded the repository to include some 
IncRNAs as well as bona fide RNA structural motifs [52]. The 
latter are defined as “a non-trivial, recurring RNA sequence 
and/or secondary structure that can be predominantly described 
by local sequence and secondary structure elements” and can be 
part of a larger structure or noncoding RNA [53]. RFAM includes 
Covariance Models (CMs) for each entry, or family, in the data¬ 
base. CMs are a probabilistic representation of RNA structure 
profiles that can be used to scan a genome (or transcrip tome) for 
sequences compatible with a given consensus structure. They can 
be used by the Infernal program to scan large metazoan genomes 
in minutes and report homologous hits with high accuracy [54]. 
The Infernal software package can also generate a CM from a 
given multiple sequence and structure alignment and thus permits 
using custom CMs to perform a search. Detailed instructions on 
how to use Infernal can be found at http://infernal.janelia.org/ 
as well as in [55]. 

There are also alternative bioinformatics resources for RNA 
structural homology search. The RNAmotif program enables 
users to construct descriptors of a target RNA structure, then 
scans a sequence database, and reports all compatible sequences 
[56]. Although the software is somewhat out of date, RNAmotif s 
capacity to construct detailed and customized RNA structure 
descriptors manually and with relative ease justify its pertinence. It 
also enables the inclusion of tertiary structural elements such as 
pseudoknots, triplexes, and quadruplexes. Unfortunately, it does 
not consider thermodynamic stability or base-pairing probabilities 
and, consequently, can produce a large amount of biologically 
irrelevant hits unless the results are filtered appropriately (for a 
practical example of how this may be performed, please refer to 
the last paragraph of Subheading 3). Alternatively, the recently 
developed LoeaRNAsean algorithm [57] can consider the local 
structural environment in the target sequence when performing a 
scan using a base pair probability matrix ( see later) as a query, which 
can be generated from a single sequence or an alignment of several 
sequences. 

The computational prediction of RNA secondary structures from 
sequence alone was one of the first challenges in bioinformatics. 
Consequently, modern software packages such as RNAfold [58], 
UNAfold [59], and RNAstrueture [60, 61] are quite efficient at 
predicting the most thermodynamically stable RNA secondary 
structure—Minimum Free Energy (MFE)—for a given input 
sequence. Unfortunately, MFE structural predictions do not always 
represent the biological reality and, on their own, are not usually 
considered as a robust qualification of function. This is particularly 
true for IncRNAs, which can be tens of thousands of nucleotides 
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long. Locally stable RNA secondary structures, which might com¬ 
pose functional units (or modules) of a IncRNA, can be overlooked 
in favor of long-range base pairings that contribute more toward 
lowering the overall free energy score. Furthermore, the dynamic 
structural nature of RNA macromolecules also confounds RNA 
structure prediction, as noncoding RNAs can form more than a 
single functional structural topology (riboswitches are a good 
example). It is therefore beneficial to consider an ensemble of 
suboptimal structures when characterizing the function of noncod¬ 
ing RNAs, as exemplified in Fig. 1. 

A more biologically relevant alternative to the MFE structure is 
the centroid, which consists of the structure with minimal distance 
to all other structures in a set of suboptimal structures. The cen¬ 
troid is usually generated through the partition function, which 
estimates the statistical distribution of all possible RNA structures 
within a given thermodynamic range (Boltzmann ensemble). 
Although centroid estimators have been shown to outperform 
MFE predictions on known RNA structures [62], they do not 
necessarily inform about the stability or diversity of the structural 
landscape for a given query sequence. The latter can be evaluated in 
two ways: (1) through direct visual inspection of a base-pairing 
probability matrix, such as that produced by the “RNAfold —p n 
program in the Vienna RNA package (Fig. la) —a greater quantity 
of smaller dots is indicative of a larger diversity of compatible base 
pairings for a particular nucleotide, which is consistent with a 
reduced likelihood of forming a stable structure; and (2) through 
the command-line output of RNAfold , or the RNAfold Webserver 
[63], which produce a numerical estimate of the ensemble diversity, 
as well as the frequency of the MFE within the ensemble (i.e., how 
credible the MFE structure prediction is). A larger ensemble diver¬ 
sity value suggests that the queried RNA sequence may form a 
broader repertoire of structures or dynamically fluctuate between 
intermediary structures. 

As mentioned earlier, secondary structure prediction of indi¬ 
vidual IncRNA sequences is not a trivial task. Fortunately, the 
computational prediction of locally stable structural elements has 
been shown to be more accurate than global RNA structural pre¬ 
dictions for long RNA polymers [64]. This finding is consistent 
with the general hypothesis that IncRNAs function via local struc¬ 
tural (or unstructured) domains, such as protein-binding motifs or 
RNA-DNA interactions {see Subheading 1). RNAplfold from the 
Vienna RNA package [58] and its enhancement in LocalFold [64] 
both offer a useful solution for the manual inspection of local 
structural topologies in long noncoding RNAs. The tools produce 
a base-pairing probability matrix that spans the entire RNA 
sequence but limits the range of base-pairing interactions to a 
user-definable threshold (Fig. Id). This facilitates the identification 
of locally stable (or unstable) structures, which can reveal putative 
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Fig. 1 Representation of RNA secondary structure predictions for single sequences, (a) An RNA base-pairing 
probability matrix representing both the minimum free energy structure prediction (below the diagonal) and 
suboptimal base-pairing probabilities (above the diagonal) of a serine tRNA that forms five helices. The RNA 
sequence of interest is displayed on the X and Y axes, where each dot represents possible base pairings 
between bases ( x,]/). The size of the dots is indicative of the frequency (or probability) of the base pairings in a 
Boltzmann ensemble of suboptimal structures, as calculated by McCaskill’s partition function algorithm in the 
Vienna RNA package [58]. The base pairs forming the validated biological structure (b) are highlighted in blue 
and numbered accordingly, whereas the unpaired bases forming the anticodon are highlighted in green, (c) 
The MFE prediction forms a structure that is quite divergent to the actual tRNA, although the biological 
structure is perceptible in the suboptimal base pairings, (d) A base-pairing probability matrix generated by the 
RNApIfold algorithm on a -400 nt section of the 3' end of the NEAT1 IncRNA. Locally stable base pairings are 
displayed as described for (a), however the sequence is represented on the diagonal (i.e., the upper quadrant 
of (b) is rotated 45°). In the lower left, the bases associated to the base pairs {dots) are highlighted in blue. In 
the lower right, the tRNA-like structure at the 3' end of NEAT1 (as illustrated in Fig. 2c) is highlighted in red 
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3.3 Inferring 
Function from an 
Individual RNA 
Sequence 


3.4 Detecting 
Functional 2D Motifs 
via Comparative 
Genomics 


functional regions as well as guide the design of small interfering 
RNAs for knockdown experiments. Alternatively, there are software 
tools, such as Rnnll [65], RNAsurface [66], RNAlfoldz (part of 
the Vienna RNA package [58]), that can facilitate the identification 
of RNA subsequences presenting strong local structural stability, 
although a user-defined maximal base-pairing span is required. 

If noncoding RNAs function through the formation of stable sec¬ 
ondary structures, can structure predictions alone be used for de 
novo functional annotation of ncRNAs? This question was first 
examined over 30 years ago by comparing the RNA structure (or 
‘folding’) score of a native RNA sequence to that of shuffled 
sequences, under the premise that functional RNAs should form 
more stable structures than random sequences [67-69]. This strat¬ 
egy produced promising results, but it was consequently shown 
that the relatively higher stability of native noncoding RNA 
sequences reflected local biases in sequence composition rather 
than structural features alone [70]. In particular, the energetic 
contributions of base-stacking interactions were ignored (the 
order of consecutively arranged base pairs can significantly alter 
the free energy score). Some reports have since successfully applied 
this approach to certain classes of noncoding RNAs by using ade¬ 
quate background models that control for dinucleotide content 
[71, 72]. Known and novel RNA elements have also been predicted 
in the yeast genome using a similar strategy, several of which were 
subsequently experimentally validated [73]. 

The biological significance of IncRNAs has often been questioned 
since they (generally) display lower conservation of primary 
sequence than proteins in evolutionary comparisons [24, 74]. 
Conservation of RNA secondary or tertiary structure has rarely 
been considered in such analyses, partially due to the more complex 
bioinformatic analyses required to investigate such phenomena. 
However, probing evolutionary data for evidence of RNA struc¬ 
tural conservation is not substantially more difficult in practice than 
evaluating primary sequence conservation. In this section, we 
describe how to leverage the hallmark signature of RNA structural 
conservation, i.e., base pair covariation, to identify putative func¬ 
tional RNA motifs in multiple sequence alignments, using existing 
software. 

We recently showed that measuring RNA structure conserva¬ 
tion from genomic sequence alignments of 32 mammals could 
identify evidence of purifying selection on RNA structure motifs 
that span over 13 % of the human genome, while presenting little 
overlap with known sequence-constrained regions [33]. Evolution¬ 
ary Conserved Structure (ECS) predictions with the human 
genome as reference can be visualized in the UCSC genome 
browser (Fig. 2) as follows: 
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Fig. 2 Visualization of ECS predictions in the UCSC Genome Browser, (a) The NEAT1 IncRNA locus presenting 
several ECS predictions from [33], Six subtracks are displayed: SISSIz , SISSIz with RIB0SUM scoring, and 
ff/VAz-derived results for all significant predictions and those with structure topologies and alignments 
available to view on a web server (see Subheading 3). (b) Expanded, zoomed in view of the tracks with 
structure representations. The RNA secondary structure consensus, flanked by the outermost base pair, is 
represented by a thicker rectangle. The color of the bars corresponds to a relative measure of their scores 
(darker = stronger score), (c) Detailed illustration of a segment of the predicted structure and alignment 
obtained by clicking on an ECS prediction from (b), which also provides general predictions statistics, a dot- 
bracket representation of the consensus structure and the consensus sequence generated on the spot via the 
Vienna RNA package [58] 


1. Browse to http://genome.ucsc.edu (or any UCSC Genome 
Browser mirror), navigate to the ‘Genomes’ tab, then select 
the hgl9 human genome assembly. 

2. Click on the ‘track hubs’ button, then select the ‘My hubs’ tab. 

3. Paste in the URL for the ECS track hub ( http ://www. marti 
nalexandersmith.com/hubs/ecs/hub.txt), then ‘Add Hub.’ 
The URL can also be obtained via the supplementary informa¬ 
tion from [33]. 
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4. Browse to any region of interest, zooming out if the ECS track 
hub titles appear and nothing is displayed under them in the 
browser (usually, >1 KB of genomic span should be sufficient). 
ECS predictions are split according to the algorithms that were 
used to make the predictions ( RNAz , SISSIz , and SISSIz + 
RIBOSUM). Although all the ECS predictions are statistically 
significant (with a < 1 % false-positive rate), they are color 
coded based on their relative scores (darker = less likely to 
arise by chance). After fully expanding the tracks, either by 
clicking on the title of the track or in the individual track 
configuration below the browser, the scores associated to the 
predictions are displayed as the name of each ECS prediction. 
SISSIz -derived predictions will display Z-scores, which repre¬ 
sent the degree of observed structural conservation (in number 
of standard deviations) from the mean of a background distri¬ 
bution produced from SISSIz "s null model. There are two 
subtracks for each employed algorithm: one supporting struc¬ 
ture representations, one without. Those with structure repre¬ 
sentation also have larger segments annotated within individual 
ECSs; these correspond to the positions within the sampled 
genomic alignments that contain the outermost base pairs 
forming the conserved structure prediction (Fig. 1). 

5. Expand the ECS track display settings to ‘pack’ or ‘full’ view by 
clicking on the title bar or by selecting the appropriate view in 
the drop-down menu below the browser interface window. 

6. Directly click on a bar corresponding to an ECS prediction of 
interest. Depending on the nature of the subtrack, this will 
either: (1) link to a page with a rundown of the statistics for 
the ECS of interest as well as a description of the methodology; 
or (2) link to an external page with detailed statistics for the 
selected ECS, a colored and annotated figure of the consensus 
secondary structure corresponding, the multiple sequence 
alignment (colored and annotated) that was used to make the 
prediction, as well as the consensus structure and sequence in 
dot-bracket format (Fig. lc). The ECS tracks with structure 
representations that link to an external page (as described 
earlier) will display bars with thin and thick segments; the 
thinner extremities correspond to regions in the sampled align¬ 
ment that are not contained within the predicted secondary 
structure, whereas the thicker internal portion of the bars 
represents regions contained within the ECS prediction (see 
Note 1). 

7. Any combination of sub tracks (i.e., all ECS predictions, pre¬ 
dictions with structure representations, or the results for indi¬ 
vidual algorithms) can be hidden (or redisplayed) by clicking 
on the link in the title of the ECS predictions track, located in 
the drop-down controls section of the UCSC browser below 
the main window. 
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There are several caveats pertaining to the data currently 
contained within the ECS track hub for the UCSC browser. 
These data are derived from genome-wide screens that are resource 
intensive and, consequently, were applied to heuristic and not 
necessarily accurate genome-scale multiple sequence alignments 
(alignment errors can often be observed via close inspection of 
alignments from step 6). The quality and amount of significant 
ECS predictions will undoubtedly improve by realigning the que¬ 
ried sequences with more robust algorithms, such as Clustal Omega 
[75], MAFFT\76\ or, ideally, RNA structure alignment algorithms 
(reviewed in [77]). 

Another caveat is that the above-mentioned ECS predictions 
are generated from sliding windows of <200 nucleotides (nt), 
which includes multiple genome alignment columns that can pri¬ 
marily be composed of indels. This means RNA base pairs that are 
more than 200 nt apart are ignored. Furthermore, the sampled 
alignment windows are offset by 100 nt, therefore conserved RNA 
structures smaller than 200 nt may also be missed given an incom¬ 
plete sampling of the structure’s boundaries. 

An additional issue with the functional annotation of IncRNAs 
is that many are spliced, often comprising relatively small exons. 
Although the biological motives for IncRNA splicing remain enig¬ 
matic, one possibility is that constitutively spliced exons are joined 
to maintain the formation of higher order structures, whereas 
alternatively spliced exons contain self-contained modular units. 
Probing multiple alignments for evidence of RNA structural con¬ 
servation in spliced transcripts would thus require pasting the 
alignment blocks together first (reviewed in [42]), as well as addi¬ 
tional considerations like splice site conservation and syntenic con¬ 
tinuity in other species. 

Performing a de novo scan for ECSs in multiple sequence 
alignments, either from another reference species or from a set of 
spliced alignments, can be quite computationally intensive. The 
approach used for the genomic screen published in [33] can none¬ 
theless be performed by anyone with basic command-line experi¬ 
ence. For large alignments (whole genomes or chromosomes) 

1. Download and install the following software packages (requires 

compilation and linking the binaries to the environmental 

$PATH variable): 

(a) SISSIz 2.0 and RNAz 2.0 [78] available at http://marti 
nalexandersmith.com/ecs or via links provided in their 
original manuscripts (N.B. SISSIz 2.0 was released in 
[33]). 

(b) The Vienna RNA package at http://www.tbi.univie.ac.at/ 
RNA [58], preferably version 1.8.5 (newer versions may 
not be compatible with the software in step 2). 
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2. Download the JAVA archive containing the binary code 
required to scan .maf files from the following URL (in the 
software section): http: // martinalexandersmith. com/ecs . 

3. Ensure that the multiple (genome) sequence alignments have 
the reference species in the first row with genomic coordinates 
in the appropriate field of the .maf file. This will be used to 
output the genomic coordinates of the predictions during the 
scan. Also, ensure that the alignments present sufficiently long 
blocks (see Subheading 2 and Note 2). 

4. Launching the following command (in the appropriate direc¬ 
tory) from a UNIX terminal will provide more verbose infor¬ 
mation on the basic usage and available parameters: fava -jar 
MafScanCcr.jar Some options include window size, step or 
c sliding’ distance, realignment of the input with the multiple 
sequence alignment program MAFFf number of processors to 
use, etc. 

5. Execute the program with the selected parameters. The pro¬ 
gram will load one alignment block of the .maf input file at a 
time, with an optional realignment step to increase accuracy at 
the expense of computation time. Next, N windows are sam¬ 
pled concurrently, where N is the number of specified proces¬ 
sors (the alignments can also be run in parallel on a computer 
cluster). 

6. The program will save all sampled subalignments that score 
above the respective thresholds for each employed algorithm. 
Genomic coordinates associated to significant ECS predictions 
for the alignment’s reference species are also emitted to the 
standard output in browser extendable (.bed) format. Simply 
redirect the standard output to a file, e.g., c > output.bed' from 
the UNIX terminal. Alternatively, genomic coordinates can be 
recovered from the file names of the saved alignments, which 
encode a 6-field underscore delimited bed-compatible entry. 
Furthermore, the name field of the .bed entries also encodes 
colon-delineated statistical information about the alignment 
used to make the ECS prediction. This includes (in order): 

(a) Number of retained sequences. 

(b) Raw mean pairwise identity (including indels). 

(c) Mean pairwise identity (normalized to the shortest gapless 
sequence length). 

(d) Relative gap (indel) content. 

(e) Standard deviation of the (normalized) mean pairwise 
identity. 

(f) Normalized Shannon entropy. 

(g) Relative GC content. 
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(h) Scoring algorithm employed: s = SISSIz 2.0; r = SISSIz 
2.0 with RIB OS UM scoring; z = RNAz- 2.0. 

The fifth field of the .bed entries represents the score associated 
with the predictions. The scores have been modified to accom¬ 
modate representation in the UCSC genome browser, which 
only supports integer values. Z-scores from SISSIz predictions 
are multiplied by —100 (—2.54 = 254), whereas RNAz- 
derived scores are simply multiplied by 100 (0.85 = 85). 

7. The topology of a given ECS prediction can be visualized by 
running the RNAalifold program from the Vienna RNA pack¬ 
age on the multiple alignment associated to the predicted ECS. 
The default RNAalifold options are suitable for ECS predic¬ 
tions from SISSIz and RNAz , but the RIBOSUM scoring 
option c -r’ should be used otherwise. 

8. Because the ECS predictions are based on a consensus, it is 
possible that the reference species forms a structure that is not 
compatible with the consensus. To evaluate the likelihood of 
this structural congruence, an auxiliary program is available to 
process the alignments output from step 6 (see the supplemen¬ 
tary information of [33]). The ParseAlifold.jar program per¬ 
forms two main tasks: (1) trimming the genomic coordinates of 
the reference species to the outermost base pairs of the consen¬ 
sus structure; (2) measuring the relative difference between the 
native secondary structure for the sampled reference sequence 
and that produced from constraining the structure to the con¬ 
sensus, as produced from the c RNAfold -C’ command from 
the Vienna RNA package [58]. This is done for both the 
minimum free energy and the base-pairing probabilities gener¬ 
ated from the partition function implemented in RNAfold , 
where the probabilities of base pairs from the consensus are 
extracted from the base-pairing probability matrix. The .bed 6 
plus formatted output prints to the terminal’s standard output 
and contains the following additional fields: 

(a) Average base-pairing probability of the minimum free 
energy structure for the reference species. If the base is 
unpaired, this value is calculated as 1—the sum of all 
probabilities for the given base. 

(b) Average base-pairing probability of consensus- 
constrained reference structure. 

(c) Base-pairing probability ratio (constrained/native). 

(d) Free energy (kcal/mol) of the consensus-constrained ref¬ 
erence sequence. 
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(e) Minimum free energy (kcal/mol) of the native reference 
sequence. 

(f) Free energy ratio (constrained/native). 

(g) Length of prediction (nt). 

(h) Dot-bracket secondary structure mask of RNAalifold 
consensus. Ex: (((((...))))). 

3.5 The Next In higher eukaryotes, recurring RNA structural motifs that display 

Frontier: Functional evidence of evolutionary conservation provide a tangible basis for 
Parsing of IncRNAs the functional annotation of noncoding sequences, as they may 
indicate protein-interaction domains that potentially nucleate reg¬ 
ulatory networks. For example, Parker et al. [79] performed a 
similar analysis using evolutionarily conserved RNA secondary 
structures predicted with EvoFold [80] to generate profile Stochas¬ 
tic Context-Free Grammars (SCFGs), which were then used to scan 
the human genome for paralogs to the RNA structural predictions. 
The results were grouped into RNA families based on their struc¬ 
tural similarities and revealed 220 families of RNA structures, 
including 172 novel RNA structure families. 

However, as effective as bioinformatic methods may be, they 
seldom indicate what biological functions or processes are involved 
(unless, of course, there is a high level of homology to well- 
characterized RNAs). Assigning biological functions to novel RNA 
structural motifs can be achieved via modern experimental techniques 
predicated on high-throughput sequencing, such as RNA immuno- 
precipitation (RIP-Seq), crosslinking immunoprecipitation (CLIP- 
Seq), and chromatin isolation by RNA purification (ChiRP-Seq). 
These methods can identify the RNAs interacting with specific pro¬ 
teins, providing sets of RNA sequences that share the same protein¬ 
binding characteristics. The increasing availability of next-generation 
sequencing technologies will likely increase contributions to public 
specialized databases such as starBase [81], which contains numerous 
RNAseq data sets relating to RNA-protein interactions. Mining these 
data with advanced bioinformatics tools will bridge the gap between 
functional annotation of IncRNAs and RNA structure prediction. 

Computational identification of RNA structures common to a 
set of sequences can currently be performed via clustering algo¬ 
rithms based on pairwise comparison scores, obtained through 
either RNA structure alignment algorithms (e.g., CARNA [82], 
LocaRNA [83], FOLDALING [84, 85]) or other secondary struc¬ 
ture comparison strategies (e.g., GmphClust [86], RNACluster 
[87], and NoFold [88]). These approaches have been applied to 
small RNA sequences and have successfully identified both known 
and yet to be characterized noncoding RNA families based on their 
shared secondary structures [79, 83, 85, 87, 88]. Unfortunately, 
IncRNA sequences are not directly amenable to such structure- 
motif enrichment approaches because they may harbor extraneous 
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sequence elements, thus requiring additional processing such as the 
extraction of subsequences presenting stable RNA structure 
domains. Refining the aforementioned methods and applying 
them to sequencing data that target RNA-protein interactions 
will help identify new functional RNA structure motifs, which 
can, in turn, serve to index genomic sequences. This strategy will 
lay the foundations required to unravel the structure-function 
relationships of IncRNAs, categorize their repertoires, and annotate 
the expanses of noncoding sequences in vertebrate genomes. 


4 Notes 


1. Sense or antisense? Given the complementary nature of canoni¬ 
cal RNA base pairs (G-C/C-G), it is not uncommon to find 
that both strands of DNA produce high scoring, consensus 
secondary structure predictions. When these bidirectional 
structure predictions arise in regions with little or no associated 
transcription, determining the most likely orientation of the 
putative transcript can be quite difficult. Sequences with high 
GC content are more susceptible to this phenomenon because 
there are fewer G-U base pairs, which can effectively be used to 
discriminate the host transcript’s orientation (the antisense 
A-C base pair does not contribute to canonical Watson-Crick 
base pairing). Occasionally, visual inspection of the alignments 
and consensus RNA secondary structures can be sufficient to 
identify the most likely orientation, i.e., the strand that pro¬ 
duces more base pairs (G-U in particular). Otherwise, the most 
likely orientation can sometimes be determined by using the 
RNAstmnd program [89], a machine learning algorithm 
which was specifically developed for this purpose (not covered 
here). RNAstmnd generates a score which estimates the orien¬ 
tation of a consensus RNA secondary structure from a given 
multiple sequence alignment used as input. 

2. Genomie alignments and bloek sizes. As a strict minimum, the 
blocks should be at least the length of the window size for 
sampling structure conservation (by default, 200 nt). The lon¬ 
ger the blocks are, the more consecutive overlapping windows 
will be sampled, which will provide greater genomic coverage 
of the computational screen. Usually, alignments with more 
species will present shorter blocks given the greater diversity of 
synteny. In this case, c stitching’ the alignment blocks together 
can also abrogate synteny in nonreference sequences (i.e., all 
but the first row in the alignment), which may introduce 
uncertainty in the consensus structure evaluation as noncon¬ 
tiguous sequences are treated as contiguous. For example, a 
500 nt segment from human chromosome 12 might align to a 
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250 nt segment from mouse chromosome 3 and 250 nt from 
mouse chromosome 6, therefore any windows sampled between 
the segment joining both mouse chromosomes will not reflect 
the biological reality (unless these regions are prone to fusion or 
trans-splicing events, an unlikely predicament). From a practical 
viewpoint, the multiple genome alignments produced by the 
Enredo-Pectm-Ortheus pipeline [39, 90] (available via the 
ENSEMBL comparative genomics portal: http://ensembl. 
org/info/genome/compara/index.html) present much longer 
syntenic blocks than those from TBA/Multiz [41] (accessible via 
the UCSC Genome Browser comparative genomics tracks), thus 
avoiding the need to c stitch’ several small alignments together. 
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Chapter 5 


Construction of Functional Gene Networks 
Using Phylogenetic Profiles 

Junha Shin and Insuk Lee 


Abstract 

Functional constraints between genes display similar patterns of gain or loss during speciation. Similar 
phylogenetic profiles, therefore, can be an indication of a functional association between genes. The 
phylogenetic profiling method has been applied successfully to the reconstruction of gene pathways and 
the inference of unknown gene functions. This method requires only sequence data to generate phyloge¬ 
netic profiles. This method therefore has the potential to take advantage of the recent explosion in available 
sequence data to reveal a significant number of functional associations between genes. Since the initial 
development of phylogenetic profiling, many modifications to improve this method have been proposed, 
including improvements in the measurement of profile similarity and the selection of reference species. 
Here, we describe the existing methods of phylogenetic profiling for the inference of functional associations 
and discuss their technical limitations and caveats. 

Key words Phylogenetic profiling, Functional association, Gene network 


1 Introduction 


The discovery of all the functional components of cells and the 
elucidation of all their interactions are the grand challenges in 
systems biology. Phylogenetic profiling [1-3] is a method in 
which interactions between genes are inferred using their similarity 
in inheritance across species. During speciation, genetic informa¬ 
tion about functional components, such as proteins, is passed from 
ancestral species to new species. Given that most, if not all, cellular 
processes are operated by a set of genes, which are often repre¬ 
sented as a complex or a pathway, the functional interdependence 
among genes is often coinherited. Functional constraints during 
speciation therefore would display a similar phylogenetic pattern 
across species, which provides the opportunity to infer functional 
association between genes. 

Large-scale gene networks, which can be constructed from the 
functional associations inferred from various types of biological 
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2 Materials 


2 .1 Protein Sequence 
Data 


data, including phylogenetic profiles, have proven useful to the 
study of gene and pathway functions [3, 4]. The propagation of 
function and phenotype information through the network facili¬ 
tates the identification of novel gene functions and/or loss-of- 
function phenotypes [5]. Given that phylogenetic profiling methods 
require only sequence data, these methods are likely to benefit 
significantly from the recent advances in genome sequencing tech¬ 
nology such as next-generation sequencing. The rapid growth of 
the number of genome-sequenced species potentiates the power of 
phylogenetic profiling methods, because additional species can fill 
the current gaps in knowledge about the evolutionary trajectories 
of cellular functions. The expansion of genome projects therefore 
can make a direct contribution to the understanding of gene and 
pathway functions. 

Since the initial development of phylogenetic profiling meth¬ 
ods [2, 3], various approaches have been explored to improve the 
performance of these methods. These approaches differ mainly 
with respect to the measurement of profile similarity and the selec¬ 
tion of reference species. In this chapter, we will describe concep¬ 
tual and technical differences among these methods as well as their 
strengths and weaknesses. In addition, we will discuss the limita¬ 
tions and caveats in the construction of gene networks using phy¬ 
logenetic profiling methods. 


Proteins are the major biomolecule in cellular processes. Amino 
acid sequences therefore may be more relevant to biological func¬ 
tion than nucleic acid sequences. Hence, we generally use protein 
sequences to profile functional conservation across species. To look 
for protein conservation between species, we generally use the 
standard sequence homology search software, BLAST (Basic 
Local Alignment Search Tool). 

Protein sequence data for completely sequenced species are avail¬ 
able from major archive databases such as the National Center for 
Biotechnology Information (NCBI, ftp://ftp.ncbi.nlm.nih.gov/ 
genomes), the European Bioinformatics Institute—European 
Nucleotide Archive (EBI-ENA, ftp://ftp.ebi.ac.uk/pub), and the 
Ensembl Genome Browser (ftp://ftp.ensembl.org/pub). 
Although protein sequence data are generally provided in FASTA 
format (see Note 1), the file extensions of FASTA protein sequence 
files differ across data providers (e.g., faa from NCBI and .pep.all. 
fa from EBI). These sequence data originate from public data 
repositories for genome projects that are maintained by either 
genome sequencing centers or genome project consortiums 
(Table 1). These repositories are replenishing sources for genome 
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Table 1 

Public data repositories for genome sequences 


Epository 

URL 

Maintained by genome sequencing centers 

The Broad Institute (BI) 

http : // www. broadinstitute. org/scientific- 
community /data 

Department of Energy Joint Genome Institute 
(DOE-JGI) 

ftp : //ftp. j gi-psf. org/pub / J GI_data 

The J. Craig Venter Institute (JCVI) 

ftp : //ftp. j cvi. org/pub/data 

Genolevures 

http://www.genolevures.org/download.html 

Genoscope 

http : // www. genoscop e. cns. fr/spip/Genoscope - s - 
Resources.html 

The Beijing Genomics Institute 

ftp : //ftp. genomics. org. cn/pub 

The Genome Database for Rosaceae (GDR) 

http://www.rosaceae.org/ 

VectorBase 

https : //www.vectorbase. org/downloads 

Maintained by project consortiums 

Consensus CDS Project (CCDS) 

ftp : //ftp. ncbi. nlm. nih. gov /pub/CCDS/ 

Saccharomyces Genome Database (SGD) 

http : //www.yeastgenome.org/download-data 

Wormbase 

ftp : //ftp .wormbase. org/pub/wormbase/species / 

FlyBase 

ftp://ftp.flybase.org/genomes/ 

The Arabidopsis Information Resource (TAIR) 

ftp : //ftp. arabidopsis.org/home/tair/Sequences / 


project data that recently have been completed. Note that when we 
use sequence data obtained directly from the original repository, we 
need to confirm the data publication policy. Some genomes that 
recently have been sequenced may be under data usage restrictions. 
These repositories also may contain data from incomplete genome 
projects. In these cases, genome sequence data are presented as 
contigs or scaffolds, which may not provide a comprehensive list of 
proteins for a given species. An incomplete list of proteins may 
generate inaccurate phylogenetic profiles, which in turn may affect 
the quality of inferred functional links. 

2.2 Homology Search Functional conservation between species can be represented by the 

Software occurrence of homologous proteins. NCBI BLAST is the most 

popular software by which to search for homologous proteins 
across species. Installation files and source codes for the latest 
version of BLAST are available from the ftp site (ftp://ftp.ncbi. 
nlm. nih. gov/blast/executables/bias t+/LATEST / ). 
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3 Methods 

3.1 Protein 
Homology Search 
Using the Stand-alone 
BLAST Software 


FASTA input files should be carefully prepared to run an efficient 
homologous search in BLAST. Two input files are required to 
execute the stand-alone BLAST program, a query-sequence file 
and a reference-sequence file. The following steps should be taken 
to run the homology search: 

1. Create a query-sequence file that contains query protein 
sequences, which are usually from a target species, for gene 
network inference. 

2. Create a reference-sequence file of concatenated protein 
sequences from the genomes of reference species. No specific 
order for concatenated protein sequences is required, but 
unique identifiers for each protein sequence are warranted. 
The assignment of a specific “genome sequence code (GC)” is 
recommended by users as well as the homology search program. 
For example, “GC120-077-SEQ0009” stands for “the 9th pro¬ 
tein sequence encoded in the 77th reference species out of a 
total of 120 reference species.” Once a GC is assigned to a 
protein sequence in the reference-sequence file, the FASTA 
metadata lines of the reference-sequence file need to be modified 
by adding the GC after the “>” symbol, as shown as follows: 

>gi | 158249333 | ref | YP_0015144.11 response regulator 
>gc120-077-seq0009 gi | 158249333 | ref | YP_0015144.11 
response regulator 

3. Install the stand-alone BLAST program. Installation methods 
are well described on the BLAST help webpage (http://www. 
ncbi.nlm.nih.gov/books/NBK52638/) for each operating 
system. 

4. Format both the query-sequence file and the reference-sequence 
file using the BLAST “formatdb” program by typing the follow¬ 
ing command line: 

/ [path to blast ] /bin/f ormatdb -i [ input FASTA f ile ] 
-p T -o F 

Factors that can cause errors during the formatdb procedure are 
described in Note 2. 

5. Run “blastp,” a BLAST program for protein sequences. Com¬ 
mands and switches for executing the stand-alone BLAST pro¬ 
grams are well described on the BLAST manual webpage 
(http://www.ncbi.nlm.nih.gov/books/NBK1763/). Here is 
an example that uses the default settings: 

/[path to blast]/bin/blastp -db [reference- 
sequence file] -query [query-sequence file] -out 
[output filename] 
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3.2 Construction of 
the Phylogenetic 
Profiles 


3.3 Measuring the 
Similarity Between 
Phylogenetic Profiles 


The “blastp” program searches for sequence homology for only 
one query sequence at a time against all the reference sequences. If 
parallel searches for many query sequences are intended, then an 
additional “loop” script for iterative executions is recommended. 

6. Obtain the results of the “blastp” analysis, which includes the 
sequence IDs, homology scores, lengths of the matches, and the 
actual matched positions for both the query and matched refer¬ 
ence sequences. 

A phylogenetic profile for a query protein is represented as a vector 
of homology scores across reference species. Two alternative types 
of homology scores may be used: binary scores for simple occur¬ 
rences of homology or quantitative measures of homology between 
a query protein and a protein of the reference species. If the binary 
score is used, then all the scores of the profile vectors are repre¬ 
sented as either a “1” or a “0,” which indicate the presence or 
absence of a homologous protein in the reference species, respec¬ 
tively. The assignment of binary scores based on the BLAST results 
requires a threshold for the BLAST hit score (e.g., assign “1” if the 
E-value < le—03). If the quantitative measure is used, then 
BLAST hit scores are used as the profile score; this score is often 
transformed for the purpose of mathematical procedures [6]. If 
multiple homologous proteins exist in a reference species, only a 
single BLAST hit score for the best homolog remains in the profile. 
Both binary and quantitative score types have strengths and weak¬ 
nesses (see Note 3). 

A phylogenetic profile matrix is comprised of the profiles from 
multiple query proteins (Fig. 1). Note that the order of the refer¬ 
ence species (i.e., the order of the columns in the phylogenetic 
profile matrix) does not affect the measurement of profile similarity, 
because each reference species is assumed to be orthogonal. It has 
been reported that the composition of reference species datasets 
significantly impact the performance of phylogenetic profiling 
methods in retrieving functional associations between genes 
[7-10] (see Note 4). 

The identification of functional associations between query 
protein-coding genes can be accomplished by measuring the simi¬ 
larity between phylogenetic profiles. Various similarity measures, 
such as the Hamming distance [11,12], the Jaccard coefficient [13, 
14], Pearson’s correlation coefficient [13], and Mutual Informa¬ 
tion (MI) [6, 15, 16], have been used in phylogenetic profiling 
analyses. Different measures focus on different aspect of the pro¬ 
files. Consequently, similarity scores may differ significantly across 
measures. Testing multiple measures and then choosing the mea¬ 
sure with the best performance for the given profiles will yield the 
optimal analysis. 
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Fig. 1 A schematic summary of the phylogenetic profile matrix. The phylogenetic profile of a query protein (a 
row) is a vector that consists of listed scores (gray-scaled rectangle), which indicate the occurrence of a 
homolog within a reference species (columns). Two proteins with similar profiles (e.g., proteins A and B) are 
expected to be functionally associated because of their coinheritance pattern 

To illustrate how to measure profile similarity based on homol¬ 
ogy scores in practice, we present a specific analysis that uses MI [6]. 
MI was developed originally for categorical values. The BLAST 
scores of profiles, however, are continuous values. The MI calculation 
therefore requires discretization of the BLAST scores (see Note 5). 
The MI can be calculated for the discretized BLAST scores using the 
following procedures: 

1. Calculate the “marginal entropy” and the “joint entropy.” 

The marginal entropy of gene A, H(A ), is calculated by 

H{A)= 

i= 1 

where N is the number of the assigned bins and 

# of profile scores that belong to bin i f or protein A 
1 total # of profile scores f or protein A 

The joint entropy between gene A and gene B, iT(A,B), is 
calculated by 

N N 

H(A, B) — — YAIPA B ,), 

i= 1 j= 1 


where 
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3.4 Benchmarking 
Functional 

Associations Inferred 
by Phylogenetic Profile 
Similarity 


3.5 Caveats and 
Limitations 


that belong to bin i for protein A and bin j for protein B for the same reference species 
total # of profile scores for proteins A and B 

2. Calculate the MI of genes A and B, which is calculated by 

MI (A, B) = H(A) + H(B) - H(A , B) 

3. A gene pair with a higher MI value is more likely to have a 
functional association. 


Functional gene networks are highly applicable to the study of 
cellular systems (see Note 6). We can construct a network of func¬ 
tional associations by phylogenetic profile similarity. An evaluation 
of the inferred functional associations is critical to network con¬ 
struction. To benchmark the inferred functional links, we use the 
“gold standard” (GS) functional associations derived from func¬ 
tional annotation databases, such as the Gene Ontology biological 
process (GOBP) [17] and the Kyoto Encyclopedia of Genes and 
Genomes (KEGG) [18], by pairing two genes that share any anno¬ 
tation term. This process generates “GS positives.” We can also 
generate “GS negatives” by pairing two genes that are annotated 
but do not share any annotation terms. 

One simple way to benchmark inferred links using the gold 
standard set is by using the frequency of the gold standard link 
among all the inferred links with functional annotations, which is 
represented by the following equation: 


Benchmark score = 


# GS positives 

(#GS positives) + (# GS negatives) 


In practice, inferred gene pairs are ordered by decreasing similarity 
scores. Benchmark scores then are calculated for each bin of 1000 
gene pairs from the top scores. 

Benchmarking is also useful for finding the optimal analysis 
condition that achieves the maximal inference power. Many vari¬ 
ables and parameters can be selected during the analysis process. 
These variables include: (1) the composition of the reference spe¬ 
cies dataset, (2) the types of profile scores (e.g., binary or quantita¬ 
tive), (3) the similarity measures between phylogenetic profiles, and 
(4) other free parameters. The optimal analysis parameters are those 
in which the best performance is observed. 


Homologous proteins can be classified into two major classes: 
orthologs and paralogs. Orthologs are homologous proteins passed 
from ancestral species to their descendants; these proteins tend to 
retain their function. In contrast, paralogs are homologous proteins 
that have appeared as a consequence of gene duplications within a 
species. A gene duplication event followed by a beneficial modifica¬ 
tion is a major evolutionary mechanism that creates either a new 
function (neofunctionalization) or diversifies an existing function 
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(subfunctionalization). Paralogs therefore tend to have different 
functions. These functionally divergent paralogs have the same 
phylogenetic profiles, however, which results in a high similarity 
score. Paralogous pairs therefore need to be excluded from inferred 
functional associations. A simple way to detect paralogs is to “self- 
BLAST” using two identical query sequence files. Alternatively, 
predefined paralogous relationships can be obtained from databases 
such as the KEGG Sequence Similarity DataBase (SSDB, http:// 
www.kegg.jp/kegg/ssdb/). 

The phylogenetic profiling method relies on homology infor¬ 
mation; therefore, this method may not be applicable for proteins 
that lack homology across reference species. This limitation can be 
overcome by adding more sequenced species to the phylogenetic 
analysis. For example, the phylogenetic profiling method has not 
been successful for human proteins. This lack of success may be due 
to the fact that most of the reference species used in previous 
analyses have been unicellular microbes. It is expected that recently 
launched large-scale genome project consortiums such as the 
‘Genome 1 OK Project’ (https://genomelOk.soe.ucsc.edu/) [19], 
which proposes to sequence 10,000 vertebrates, will improve the 
application of the phylogenetic profiling method to human pro¬ 
teins in the future. 


3.6 Case Study: 
Construction of a 
Yeast Gene Network 
Using Phylogenetic 
Profiles 


Here we present an example of a yeast (Succharomyces cerevisiae) 

gene network to illustrate how to construct a gene network using 

the phylogenetic profiling method (Fig. 2). 

1. Download the yeast protein sequences (i.e., query sequences) 
from the Saccharomyces Genome Database repository (SGD) 
[20] and download protein sequences for multiple reference 
species from the major archive databases, such as NCBI, EBI, 
and Ensembl. Add systematic genome codes (GC) to the FASTA 
metadata lines and concatenate all protein sequences of the 
reference species to create a single ‘reference-sequence’ file. 

2. Format the input sequence files and execute the BEAST pro¬ 
gram. Exclude insignificant hits (e.g., an E-value > 1) from the 
BLAST results. 

3. Construct a phylogenetic profile matrix of the BLAST E-values 
(Fig. 2a). Include only the best hit score for each reference 
species per query protein in the matrix. If there is no BLAST 
hit of a query gene for a reference species, assign a score of “1” 
for that reference species. 

4. Calculate the MI scores between all the query protein pairs and 
sort them in descending order (Fig. 2b). 

5. Benchmark the inferred protein pairs with the gold standard 
functional pairs. The optimal parameters (e.g., the number of 
bins for the BLAST E-value discretization in the MI calculation) 
can be determined from the best benchmark curve (Fig. 2c). 
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Fig. 2 Construction of the yeast gene network using phylogenetic profiles, (a) A heat-map view of a 
phylogenetic profile matrix of yeast proteins. The rows represent yeast proteins and the columns represent 
reference species. The homology score is indicated by the grayscale such that a stronger protein homology 
(i.e., a lower BLAST E-value) is represented as a darker color, (b) Yeast gene pairs are listed in descending 
order of the Mutual Information (Ml) scores, (c) Benchmark curves for the inferred functional associations 
between yeast genes from the Ml calculation are constructed using different numbers of bins for the score 
discretization. The percentage of gene pairs that share functional annotations (GS positives) among all 
annotated gene pairs by the Gene Ontology biological process terms (y-axis) is measured for different 
coverage of all the yeast coding genes (x-axis). The best network was inferred using 10 score bins for the 
Ml calculation, (d) A part of the inferred gene network was visualized by the Cytoscape program 

6. Define a gene network by applying a threshold to the benchmark 
score. The resulting network is analyzed by various network 
algorithms and visualized in Cytoscape [21] (Fig. 2d). 
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4 Notes 


1. The first line of the FASTA format file starts with a “>” symbol 
followed by metadata, such as the name of the protein, the name 
of the origin organism, and a short description of the molecule. 
The sequence information, which is written in single letters, 
starts from the next line. The sequences are provided either as 
a single line for the whole sequence or as multiple text lines of 
-60-80 letters. Both styles can be used for BLAST input files. 

2. One factor that causes frequent errors during the BLAST “for- 
matdb” process is the existence of the “*” symbol within or at 
the end of a sequence. For some sequence data repositories, the 
“*” is used either to mark an unidentified or suspicious amino 
acid position or to designate the last position in the sequence. 
The “formatdb” program cannot handle this symbol. If the 
symbol is located at the last position of the sequence, then it 
must be removed. If the symbol is located within a sequence, 
then the entire sequence must be excluded from the analysis. In 
addition, if the BLAST program runs in a UNIX-affiliated 
operating system such as LINUX or MAC OS, then “ A M” 
(i.e., “control-M”), which is placed at the end of every text 
line, also causes errors in the “formatdb” procedure. “ A M” 
represents a carriage return in the DOS system, and UNIX 
systems do not recognize it. This conflict occurs when the 
FASTA file is generated in a DOS system. Various ways exist to 
remove the “ A M” from the file, including the use of a shell 
command “dos2unix.” 

3. The simplicity of the binary score is advantageous for calculating 
the profile similarity measure due to the low computational 
burden. In contrast, quantitative scores provide high-resolution 
information, which potentially leads to a more accurate measure 
of similarity between profiles. 

4. There exists controversy about the proper composition of refer¬ 
ence species datasets. Several studies have reported that phylo¬ 
genetic profiles consisting of only prokaryotic genomes perform 
well, and that the addition of eukaryotic genomes reduces per¬ 
formance [7, 9]. In contrast, another study has reported that 
eukaryotic genomes are improving the performance of phyloge¬ 
netic profiling methods as the number of completely sequenced 
eukaryotic genomes grows [8]. Furthermore, the effect of an 
increased number of reference species and the selection of rep¬ 
resentative genomes from the reference species on the perfor¬ 
mance of phylogenetic profiling methods have been investigated 
thoroughly [10]. 

5. The BLAST E-value scores of the profiles range from “0” (i.e., 
perfect hit) to “1” (i.e., no valid hit). The discretization of 
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continuous scores requires the assignment of score bins. The 
number of bins can be settled arbitrarily. For example, we may 
define score bins by dividing the entire score range into equal 
intervals. The distribution of E-values, however, is skewed 
toward 1. Equal intervals therefore will result in a heavily biased 
binning of the scores. To resolve this problem, we fix a bin for all 
scores of 1 and divide the remaining scores with an equal bin 
distribution. For example, suppose that the profile matrix con¬ 
sists of ten BLAST E-values: [x | BLAST E-values of the profile 
matrix} = {0, 0, 0.2, 0.4, 0.(5, 0.5, i, i, i, 1). If we create four 
score bins (i.e., categories) for discretization, the score sets 
would be {1, 1, 1, 1}, {0, 0}, {0.2, 0.4}, and {0.6, 0.8}. 

6. Network approaches have proven useful for biological studies 
[22]. For example, the novel functions of a gene can be inferred 
from network neighbor genes that are functionally annotated 
using the guilt-by-association principle [23]. A subnetwork 
structure, which is often called a module, is another useful 
network feature to investigate functional associations. If a 
group of genes is highly interconnected, representing a pathway, 
then this group of genes is likely to operate within the same 
cellular process. Other genes that are connected to this group 
may be new members of the cellular process. 
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Chapter 6 


Inferring Genome-Wide Interaction Networks 

Gokmen Altay and Onur Mendi 


Abstract 

The inference of gene regulatory networks is an important process that contributes to a better 
understanding of biological and biomedical problems. These networks aim to capture the causal molecular 
interactions of biological processes and provide valuable information about normal cell physiology. In this 
book chapter, we introduce GNI methods, namely C3NET, RN, ARACNE, CLR, and MRNET and 
describe their components and working mechanisms. We present a comparison of the performance of 
these algorithms using the results of our previously published studies. According to the study results, which 
were obtained from simulated as well as expression data sets, the inference algorithm C3NET provides 
consistently better results than the other widely used methods. 

Key words Gene network inference, Gene network inference (GNI) algorithms, Bioinformatics 


1 Introduction 


The inference of gene regulatory networks (GRN), which can be 
seen as a reverse engineering problem, is a process of estimating 
direct physical associations among genes from gene expression data 
[1]. This process can provide valuable information about normal 
cell physiology, development, and pathogenesis and contribute to a 
better understanding of biological and biomedical problems [2-4]. 
Gene network inference (GNI) algorithms are widely used in bio- 
informatics to detect the activator genes of genetic diseases, to 
determine the functions of the regulating and regulated genes, 
and to obtain drug targets [5]. GRNs aim to capture the interac¬ 
tions between molecular entities and are represented as graphs in 
which nodes represent genes, proteins or metabolites and edges 
represent molecular interactions [6]. In vivo or in vitro, molecular 
interactions can be detected accurately by classical molecular biol¬ 
ogy approaches. Unfortunately these methods are laborious and 
the number of interactions that can be studied by these approaches 
is limited [7]. Gene networks such as transcriptional regulatory 
networks, protein networks, or metabolic networks represent 
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blueprints of dynamical processes within cells. Different types of 
gene networks have different effects on the dynamical processes of 
cellular systems [8, 9]. Hence, gene network inference has been 
identified as a focal point in systems biology. However, GNI is a 
challenging problem because of the current very large-scale 
biological datasets and the noise caused by experimental and 
computational processes. 

The steps of the gene network inference process are shown in 
Fig. 1 . The dataset obtained from microarray data analysis consists 
of gene expression levels. Firstly, by using these preprocessed 
expression values, a gene expression matrix is created. In this 
matrix, each row corresponds to a gene whereas each column 
corresponds to a sample. The second step is estimating interaction 
scores of gene pairs. In this step, association score estimators such 
as correlation-based, entropy-based and direct mutual information 
(MI) estimators are used to obtain interaction scores. A dataset 
discretization operation is required in order to use MI estimators. 
At the end of the second step, a square gene association matrix is 
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Fig. 1 The work flow of gene network inference 
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obtained. Finally, GNI algorithms are applied to this association 
matrix and the inference of gene regulatory network process is 
completed. 

The most crucial process of GNI algorithms is to obtain the 
interaction scores among cell molecules. The interaction scores 
among gene pairs are determined from the gene expression datasets 
by the association score estimators. However, there is no commonly 
accepted estimator that is known to provide the best performance 
for GNI methods. In the study [5], 27 different interaction esti¬ 
mators were reviewed and 14 most promising estimators were 
evaluated. According to the study results; BS with spline order 
2 (BS2), BS with spline order 3 (BS3), Kernel Density Estimator 
(KDE), Pearson-based Gaussian (PBG), and Spearman-based 
Gaussian (SPG) were found to be the best association score esti¬ 
mators regarding the performance and runtime (see Note 1). 
Therefore, we preferred Pearson-based Gaussian estimator in our 
study [5, 10]. 

Several popular techniques have been developed to infer GRNs 
from microarray gene expression data. The best of these methods 
are based on information theory [11, 12]. The main principle of 
information-based methods is estimating mutual information (MI) 
values among gene pairs [13, 14]. MI based methods are able to 
detect linear and nonlinear effects among gene pairs [15, 16]. 
Furthermore, they enable us to work with large sample sizes such 
as 25,000 genes [17]. 

One of the first algorithms introduced was RN (relevance 
network) [18]. This algorithm computes all mutual information 
values for all pairs of genes and eliminates the edges among genes 
that have MI values that are not statistically significant. The second 
well-known method is ARACNE [19]. ARACNE uses data proces¬ 
sing inequality and, in addition to RN, ARACNE performs a sec¬ 
ond step to eliminate the least significant edge of a triplet of genes. 
This results in a more conservative estimation of the inferred 
network. 

CLR (Context Likelihood of Relatedness) [20] is another 
method that employs a background sensitive estimator between 
the gene pairs by converting MI estimates to values similar to z- 
scores. In contrast to RN and ARACNE, CLR estimates individual 
thresholds by considering an individual background for each pair 
of genes. In addition to these methods, MRNET (maximum rele¬ 
vance/minimum redundancy network) [21] infers a network using 
the maximum relevance/minimum redundancy feature selection 
method. Finally, C3NET (conservative causal core network infer¬ 
ence) [22, 23] has been introduced. The basic idea of C3NET is 
selecting the edge for each gene with maximum mutual informa¬ 
tion (MI) value (see Notes 2 and 3). 

The book chapter is organized as follows. In the next section 
we introduce GNI methods by describing their components and 
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working mechanisms. We start with a detailed review of C3NET 
and then give brief descriptions of the other most widely known 
inference algorithms, namely RN, ARACNE, CLR, and MRNET. 
Then we present a comparison of the performance of inference 
algorithms using the results of the study [22]. In the study, perfor¬ 
mance was evaluated using simulated as well as expression data from 
E. coli. Finally, we discuss the comparison results. 


2 Methods 


The inference of gene networks from high-throughput data is an 
important and very complex process. Recent advances in biotech¬ 
nology enable us to obtain large-scale expression data. The avail¬ 
ability of this type of data ushered in the development of gene 
inference methods [1, 22]. In this section, firstly we demonstrate 
the inference algorithm C3NET by describing its components and 
working mechanism. We describe the implementation of the algo¬ 
rithm and usage of its R package. Then, we briefly introduce the 
other most widely known inference algorithms, namely RN, ARA¬ 
CNE, CLR, and MRNET. 


2.1 C3NET 
(Conservative Causal 
Core Network 
Inference) 


The inference algorithm C3NET consists of two main steps. The 
first step is the elimination of nonsignificant connections among 
gene pairs, whereas the second step selects for each gene the edge 
with maximum mutual information (MI) value [22]. The first step 
is similar to previous methods, e.g., RN, ARACNE, or CLR. In this 
step, C3NET tests the statistical significance of pairwise mutual 
information values using resampling methods and eliminates non¬ 
significant edges according to a chosen significance level a. Mathe¬ 
matical formulation of the mutual information [24] of two random 
variables X and T is defined as 


/(x, r) = ££>(*,*) log 

xtEXy^y 


Pi^y) 
p(x)p(y)' 


In order to calculate a statistical threshold, C3NET uses resampling 
methods that estimate the distribution under the null hypothesis 
corresponding to a vanishing mutual information. For this pur¬ 
pose, it randomizes the expression data set by permuting the gene 
expression measurements n times and recalculating the distribution 
of the new pairwise mutual information for each permutation. 
Then C3NET creates a vector combining these permuted mutual 
information matrices and determines the threshold value according 
to a chosen significance level a. Visualization of this vector is shown 
in Fig. 2. The vertical (T) axis represents the frequency of mutual 
information values, whereas the horizontal (V) axis represents 
mutual information values. The threshold, denoted by J c? is 
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1: B. initiate adjacency matrix, B f) = 0 for all /, j G V 

2: C. initiate connectivity matrix, C 9 = 0 for all /, V 

3: estimate mutual information l tJ for all e V 

4: repeat 

5: Set C l} = 1 if I #!- 0 is statistically significant (hypothesis test) 

6: until all pairs i * / are tested 

7: for all /e Vdo 

8: N s (t) - {/': C, r = 1 and j * /} 

9: if N s (i) * 0 

1 0- W) - arg max ieWj(0 {/*;} 

11: else 

12: W) = 0 

13: endif 

14: end for 

15: for all ieV do 

16: ff jM=0 

17: %CO = B kU)i = 1 

18: endif 

19: end for 

20: return adjacency matrix B 

Fig. 3 The principal steps of C3NET. C3NET consists of two main steps. The first 
step is for the elimination of nonsignificant connections among gene pairs, 
whereas the second step selects for each gene the edge among the remaining 
ones with maximum mutual information (Ml) value [22] 

determined as the maximum mutual information value for the 
significant region of the null distribution, as illustrated in Fig. 2 
by the dashed line. 

Figure 3 shows the principle steps of the C3NET algorithm. 
Primarily, C3NET creates a mutual information matrix (MIM) by 
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estimating the mutual information values from the data by using an 
appropriate estimator allowing a close approximation of the theo¬ 
retical value of the population. Starting from zero matrices C and B 
(with Cij = 0 and B = 0 for all i, j ^ V) C3NET thoroughly tests 
all pairwise mutual information values Ty, i, j ^ V, and sets Qy = 
Cji = 1 if the null hypothesis H 0 : I tJ = 0 can be rejected, for a 
given significance level a [22]. 

In the second step, the most significant connection for each 
gene is selected. The algorithm first determines the neighborhood 
N s for all genes i ^ V. The neighborhood of gene i is defined by 
N s ( i) = { j : = 1 andy ^ i }. For this purpose, it uses the connec¬ 

tivity matrix C. The link corresponding to the highest mutual 
information value in the neighborhood for each gene is determined 
by using N s and I. This link is identified by 

Jed) = argmax{/y}. 

It is possible that all mutual information values I lJ for j ^ V are 
nonsignificant (N s (i) ^ 0). In this case, no index is assigned to j c (i). 
The algorithm constructs the adjacency matrix B of the estimated 
undirected network by setting Bij^ = Bj^y = 1 if j c { i) has been 
set to a valid index. The rest of the entries of B remain zero [22]. 

A visualization of the principal working mechanism of C3NET 
is shown in Fig. 4. Suppose that we have the mutual information 
values given by I. The mutual information values which are statisti¬ 
cally significant appear as “1” entries, whereas the remaining ones 
appear as “0” entries in the corresponding connectivity matrix C. 
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Then the algorithm determines statistically significant connections 
with neighboring genes with maximum mutual information. This is 
the critical step in C3NET, resulting in j c = (1, 2, 2, 2). The next 



Fig. 4 Visualization of the principal working mechanism of C3NET. The edges shown in solid and dashed lines 
correspond to significant edges. In the third step, the edges in solid lines correspond to the edges with 
maximum mutual information value 
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2.1.1 Implementation of 
C3NET: Usage of the R 
package 


step is determining auxiliary matrix Bj directly from j c . Bj contains 
exactly the edges added by each node. Due to its symmetry in its 
arguments, MI does not provide directional information, so the 
resulting adjacency matrix, B , is symmetric. 
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The resulting network represented by adjacency matrix B is a star- 
like network where gene 2 is connected to three other genes. It is 
obtained from the conversion of the asymmetric matrix Bj to a 
symmetric matrix B as shown in the example of Fig. 4. It is impor¬ 
tant to realize that each gene can add at most one connection, but 
different genes i can select the same gene j c (i). For this reason, the 
final undirected network can consist of genes having more than one 
connection to other genes (see Note 4). 

An R package called c3net is available from the website https ://r- 
forge.rproject.org/ projects/c3net and also downloadable through 
the CRAN package repository. To illustrate the principal working 
mechanism of C3NET, an example data set is provided in the R 
package. This package includes both experiment and true network 
data which can be loaded in R by executing the data (exp data) and 
data(trunet) commands of C3NET. There is a core function avail¬ 
able in c3net package that takes the data set as input and outputs the 
inferred network. This function hides individual steps of c3net and 
provides an inferred network in an all-in-one single command. An 
example usage of this function and its default parameters is as 
follows [23]: 

c3net(dataset, alpha = 0.01, methodstepl = " MTC", MTCmethod 

= "BH", itnum = 5, network = TRUE) 


The first parameter dataset is the data set where rows are vari¬ 
ables (e.g., genes) and columns are samples. The second parameter 
alphais a user defined statistical significance threshold. The method¬ 
stepl parameter is set to define the procedure that will be used to 
eliminate nonsignificant edges in Step 1 of C3NET. {“cutoff’, 
“MTC”, and “justp”} are the options that can be used for the 
parameter methodstepl. If cutoff and MTC options are selected, 
then “ cutoffMU or “ MTCmethod ” additional parameters must be 
set, respectively. If methodstepl = “cutoff ”, then it is mandatory 
cutoffMI needs to be set to a numerical value that is used as the 
cutoff value to eliminate nonsignificant MI value of edges. The 
cutoffMI value can be set to 0 to use default mean MI as cutoffMI. 
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In case MTC option is used as methodstepl , then MTCmethod needs 
to be set to employ the specific multiple testing correction method. 
The six available MTC options are; (1) Benjamini and Hochberg 
( “BH”), (2) Bonferroni ( “bonferroni”), (3) Benjamini and Yekutieli 
(“BY”), (4) Hochberg ( “hochberg”), (5) Holm (“holm”), and (6) 
Hommel (“hommel”). Additionally, itnum parameter needs to be 
set to specify the number of iterations to obtain a null distribution 
and alpha the statistical significance level. If methodstepl = “justp”, 
then only alpha and itnum need to be set and the elimination step 
of C3NET is done only with the ^-values and the significance level 
of a [23]. 

Besides providing the inference procedure of C3NET [22], the 
e3net package can also visualize the inferred network by using the 
igraph package [25]. The visualization can be enabled by setting 
the parameter network to TR UE. 

net = c3net(expdata, network = TRUE) 


Further, e3net can validate the performance of the inference by 
its eheeknet function. The eheeknet function outputs the following 
six values: precision, F-score, recall, TP, FP, and FN. C3NET 
package provides additional functions that allow individual steps 
to be performed only instead of performing the whole inference 
step. This flexibility allows users to combine internal functions of 
e3net with components outside the package [23]. 

In order to demonstrate the eheeknet function, the example 
command above was performed on the example data set located 
in the software package. “BH” multiple testing correction method 
was used for the elimination of nonsignificant edges. Statistical 
significance threshold and the number of iteration parameters 
were set to 0.01 and 5, respectively. Also, the network parameter 
was enabled for the visualization of the inferred network. The 
eheeknet results for the example data set are shown in Table 1 . 

Figure 5 is the topological representation of the inferred net¬ 
work obtained by C3NET using the Fruchterman-Reingold algo¬ 
rithm [26]. 

For ease of usage, e3net package provides a file with the name 
EXAMPFE.TXT containing examples. One can easily learn the 
functionality of e3net by executing the examples line-by-line. For 


Table 1 

Results of C3NET obtained by eheeknet function 


Precision 

F-score 

Recall 

TP 

FP 

FN 

0.68 

0.42 

0.30 

263 

123 

601 
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2.2 RN (Relevance 
Networks) 


2.3 ARACNE 
(Algorithm for the 
Reconstruction of 
Accurate Cellular 
Networks) 



Fig. 5 Inferred network of example data set by C3NET. Fruchterman-Reingold 
algorithm is used in the topological representation of the inferred network 

additional help, c3net also provides an internal help function for 
each command which can be called by using the command line. 
Further, there is a manual file accessible from inst/doc folder of 
c3nct which contains detailed explanations and examples of the 
functions of c3net [23]. 

The approach of relevance networks [13] consists in inferring a 
genetic network by computing all mutual information values for all 
pairs of genes, and linking a pair of genes (if) by an edge if their 
corresponding mutual information value I tJ is larger than a given 
threshold 7 0 . In the resulting network, two genes connect to each 
other only if I tJ > 7 0 , otherwise no edge is included between i and j. 
The threshold value 7 0 was found by randomization of the gene 
expression dataset. 

The complexity of the algorithm is 0(n 2 ) since all pairwise 
interactions are computed. Note that RN does not eliminate all 
the indirect interactions between genes since it can set an edge 
between two genes which do not interact directly, but both are 
regulated by a third gene. For example, suppose that gene i and j 
are regulated by gene k. This will result in high mutual information 
between gene pairs (ij\ ( ik ) and (Jk). Therefore, the algorithm will 
set an edge between i and j although these two genes interact only 
through gene k [27]. 

The algorithm for the reconstruction of accurate cellular networks 
(ARACNE) [19] is an extension of the RN approach. The algo¬ 
rithm starts with estimating the pairwise mutual information values 
for all genes. Then it eliminates nonsignificant values according to 
the obtained threshold 7 0 . This step is basically equivalent to rele¬ 
vance networks since it computes mutual information and declares 
mutual information values significant if 7^ > 7 0 . If 7 z/ is found to be 
significant, then an edge is included in the corresponding adjacency 


108 


Gokmen Altay and Onur Mendi 





Fig. 6 Working mechanism of DPI. Although all six gene pairs have significant 
mutual information values, the DPI will infer the most likely path of information 
flow. For example, Zwill be eliminated because l(X,Y) > l(X,Zj and l(Y, 
Z) > l(X,Z). Y Twill be eliminated because l(Y,Z) > l(Y,T) and l(Z,T) > l(Y,T). 
X — Twill be eliminated in two ways: (1) because l(X,Y) > l(X,T) and l(Y, 7) > / 
(X,T), and (2) because l(X,Z) > l(X,T) and l(Z,T) > !{X,T) [19] 


matrix between gene i and /, Aij = A Jt = 1. In addition to the first 
step, ARACNE performs a second step called data processing 
inequality (DPI). The DPI is a relation between mutual informa¬ 
tion values which means loosely that a post-processing of data 
cannot increase its information content [24]. DPI serves as a filter¬ 
ing step. DPI states that, if gene X interacts with gene Z through 
gene T (X —► T —► Z), then 

I(X,Z) < argmin {I(X, T), 7(T, Z)}. 

Here, the weakest edge of the gene triplet 7(X, Z), corresponds to 
the indirect interaction and hence is eliminated by the DPI 
approach. The working mechanism of DPI is shown in Fig. 6. 

In this step, ARACNE tests all gene-triplets (three genes with 
mutual information values larger than I 0 ) and then, for each ( ijk ), it 
eliminates the edge corresponding to the lowest mutual informa¬ 
tion value I\ = Ty, with (ij) = argmin{Ty I jk , I lk ] from the adja¬ 
cency matrix, if it is smaller than the second smallest MI value I 2 
multiplied by a factor [19]. 


= A 


if 


11 


0 h < I 2 ( 1 - 0 

1 otherwise. 


Here 0 < e < 1. e is the tolerance parameter. Simulation studies 
that allow a comparison with the underlying true network are used 
to obtain optimal values for e. For this reason, it can be said that T 0 
is found in an unsupervised and e in a supervised manner of 
learning. 

In ARACNE, each gene triplet is analyzed independently from 
the other triplets. Hence, it is possible that an edge can be included 
in the resulting network although it has been marked for removal 
by prior DPI applications to different triplets. Consequently, the 
order of examination of gene triplets does not affect the resulting 
network. ARACNE has a complexity in 0(n 3 ) since the algorithm 
considers all triplets of genes [19]. 





2.3.1 Implementation of 
ARACNE: Usage of the R 
Package 


2.4 CLR (Context 
Likelihood of 
Relatedness) 
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The ARACNE algorithm is implemented in an R/Bioconductor 
package called minct. MINET (Mutual Information NETworks) is 
an open-source Bioconductor package that includes network infer¬ 
ence methods RELNET, ARACNE, CLR, MRNET, and 
MRNETB. It can be downloaded from the CRAN package reposi¬ 
tory at http://cran.r-porject.org as well as from the Bioconductor 
website http://bioconductor.org [27]. 

Once the R platform is launched, minct package can be acti¬ 
vated by using “ library(minct)” command. The example usage of 
ARACNE algorithm with the example data set in C3NET package 
is as follows: 


data (expdata) 

# Load data 

mim < - build, mim (exp data, 

# Build mutual information matrix 

estimator = "pearson") 

using pearson estimator 

net < - araene(mim) 

# Inferring network by using aracne 

netplot(net) 

algorithm 


# Visualize inferred network 


The topological representation of the inferred network 
obtained by ARACNE in using Fruchterman-Reingold algorithm 
is shown in Fig. 7. 


The CLR algorithm is also an extension of the RN approach which 
starts by computing the pairwise mutual information values for all 
genes. Then it estimates the statistical likelihood of each mutual 
information value I tJ by comparing this MI value to a “back¬ 
ground” distribution of the MI values. In particular, two z-scores 
are obtained for each gene pair ( ij) by comparing the MI value I tJ 
with gene specific distributions, pi and pj. Here, pi and pj distribu¬ 
tions are equivalent to the distributions of MI values related to 
genes i and/, respectively [20]. CLR takes into account the score 

z^= /a? -f 

by making a normality assumption about these distributions. Here, 
Zi and Zj are the z-scores of I lp whereas zjj corresponds to the joint 



Fig. 7 Inferred network of example data set by ARACNE. Fruchterman-Reingold 
algorithm is used in the topological representation of the inferred network 
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likelihood measure. In contrast to RN and ARACNE, which 
employ a global threshold 7 q for each MI value related to pair of 
genes, CLR estimates individual thresholds by considering an 
individual background for each pair of genes. The complexity of 
CLR is 0( n 2 ) since mutual information matrix is computed once 
for each gene pair [20]. 

2.5 MRNET 
(Maximum Relevance, 

Minimum 
Redundancy) 


Xj = argmax(iy) 

XjeV\S 

s j = i(x r ,r)-^^i(XjiX k ). 

1^1 x k ^s 

Here, the score sj is the difference between the mutual information 
of Xj with the target variable T (relevance term) and the average 
redundancy of Xj to each already selected variable X^ ^ S (redun¬ 
dancy term). A gene is added to the set S only if the sj is above the 
threshold value, Sq and the score of gene Xj maximizes the value Xf. 
The algorithm repeats the iteration procedure until no further gene 
can be found that passes the threshold test. The MRNET approach 
consists in finding interaction partners for T that are of maximal 
relevance for T, but have a minimum redundancy for the already 
found interaction partners in the set S. The algorithm starts with a 
fully connected, undirected network among all genes and then it 
eliminates the edges between T and F\S, which have not max¬ 
imized the value of Xf [21]. 

MRNET has a complexity in Off x n 2 ) since the feature selec¬ 
tion step is repeated for each of the n genes. Therefore, it can be 
said that the complexity of the algorithm ranges between 0( n 2 ) and 
0(n 3 ) according to the value of/[21]. 


MRNET is an iterative algorithm that infers a network using the 
maximum relevance/minimum redundancy feature selection 
method. The algorithm identifies potential interaction partners of 
a target gene T that maximize a scoring function. The algorithm 
starts with ranking the set of input variables V according to a score 
that is the difference between the MI with the output variable Tand 
the average MI value with the previously ranked variables. The basic 
idea is ranking the direct interactions higher than indirect interac¬ 
tions [21]. The working mechanism is shown below. 


3 Comparison of Inference Methods 

The inference performance of the GNI methods may vary accord¬ 
ing to the data sets used in the assessment. Usually, a synthetic or a 
few real biological datasets are used in the analysis, but this may 
result in variations in the performance of the methods over different 
datasets. In order to assess the performance of a GNI algorithm on 
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3.1 Simulated 
(Synthetic) Data 


a de novo dataset, a framework called GANET has been developed. 
GANET assesses the performance of GNI algorithms employing 
the available literature of interaction databases. Any new real dataset 
of any size can be assessed by using GANET [28]. 

In the study [22], the performance of C3NET is compared 
with four of the most widely known inference algorithms, RN, 
ARACNE, CLR, and MRNET. Simulated as well as expression 
data from microarray experiments were used in the analysis. The 
simulations were performed by considering the ensemble approach 
mentioned in Refs. [29, 30]. 

The performance of inference algorithms is assessed by using 
the error measure F-score. F-score is obtained by using the formula 
F = 2 pr/(p + r) where p and r values correspond to the precision 
and recall. Here precision, p = TP/(TP + FP), and recall, r — TP/ 
(TP + FN/ is a function of the number of true positive (TP), false 
positive (FP), and false negative (FN) edges in an inferred network. 
In the simulation study, two biological networks were used which 
represent sub-networks of the transcriptional regulatory network 
(TRN) of E. coli [31, 32] and Teast [33]. 

SynTReN was used to randomly sample sub-networks from 
these TRNs. SynTReN is a network generator that produces syn¬ 
thetic gene expression data for approximating the experimental 
data [34]. Both networks consist of n = 100 genes. Synthetic 
expression data, which mimicks the mRNA concentration, was 
generated by using the neighbor addition method of SynTReN. In 
this process, nonlinear transfer functions based on Michaelis-Men- 
ten and Hill enzyme kinetic equations were used [35-37]. 

In this section, we present the results obtained by using the 
simulated ensemble data. Following that we give the results of 
expression data from E. coli. 

The boxplots of the resulting F- scores for two different sample sizes 
(p = {50, 200}) are shown in Fig. 8. The results were obtained by 
using a sub-network of Yeast GRN [33]. According to the results, 
C3NET provides better results than all four other inference meth¬ 
ods considering the median value of the F-score as well as the other 
statistical measures, e.g., minimum, maximum, or mean F-scores. 

Table 2 provides a summary of the results obtained for the sub¬ 
networks of Yeast and E. coli. These numerical results reveal that 
C3NET gives the best result in all cases except one: minimum 
F-score value for Yeastso. For this case, the score of C3NET 
(0.2844) is quite close to the score of the best performing algo¬ 
rithm, MRNET (0.2879). 

The analysis was repeated using a sub-network of E. coli [31, 
32]. The ensemble size was 300, which results in 300 different data 
sets, each consisting of 1000 samples. In data generation, the same 
procedure was followed as for yeast. The boxplots of the resulting 
F-scores for the three best performing algorithms C3NET, 
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Fig. 8 Boxplots of F-scores for C3NET (C3N50, C3N200), ARACNE (AR50, AR200), MRNET (MR50, MR200), RN 
(RN50, RN200), and CLR (CLR50, CLR200). Light gray color corresponds to sample size 50, whereas dark gray 
color corresponds to sample size 200 for each method. A sub-network of Yeast GRN with ensemble size 
N = 300 is used for the simulations [22] 


Table 2 

Summary of F-scores (max, min, mean and median) for C3NET, ARACNE and MRNET obtained from 
our simulations 


Dataset 

Sample size 

Statistical measure 

C3NET 

ARACNE 

MRNET 

Yeast 

200 

Max 

0.5478 

0.4919 

0.4927 



Min 

0.336 

0.2058 

0.336 



Median 

0.4628 

0.3836 

0.4455 



Mean 

0.4628 

0.3795 

0.4410 

Yeast 

50 

Max 

0.4782 

0.3983 

0.4585 



Min 

0.2844 

0.1854 

0.2879 



Median 

0.3859 

0.3166 

0.3698 



Mean 

0.3848 

0.3161 

0.3683 

E. coli 

1000 

Max 

0.6046 

0.4973 

0.5608 



Min 

0.4131 

0.1866 

0.3512 



Median 

0.5308 

0.3803 

0.500 



Mean 

0.5269 

0.3758 

0.4948 1 


The sample size for Yeast is 200 and 50, whereas the sample size of E. coli is 1000 [22] 


ARACNE, and MRNET are shown in Fig. 9. These results also 
indicate that C3NET provides the best results [22], 

Figure 10 shows the true sub-network of Yeast obtained by 
using SynTReN. In the figure, gene names are shown on the labels 
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Fig. 9 Boxplots of F-scores for C3NET (white), ARACNE (gray) and MRNET (dark 
gratf). A sub-network of E co//'TRN with is used for the simulations. Sample size 
is 1000 and ensemble size is N = 300 [22] 



Fig. 10 Sub-network of yeast consisting of 100 genes, sample size is 200. Edge 
colors are obtained from simulations of 300 data sets. The color of each edge 
reflects its mean TPR. Specifically, for solid line black edges, 1 
1 > TPR > 0.75, for dashed line black edges , 0.75 > TPR > 0.5, for 
solid line gray edges, 0.5 > TPR > 0.25, and for dashed line gray edges, 
0.25 >TPR > 0.0 [22] 

of the nodes. The type of each edge corresponds to the mean true 
positive rate (TPR). The edge types for true positive rates are as 
follows: for solid line black edges, 1 > TPR > 0.75, for dashed 
line black edges, 0.75 > TPR > 0.5, for solid line gray edges, 
0.5 > TPR > 0.25, and for dashed line gray edges, 
0.25 > TPR > 0.0. It is obvious that all leaf edges inferred by 
C3NET are correct because the edges connecting to leaf nodes are 
solid line black in both networks. Leaf node corresponds to a node 
that has only one incoming edge and no outgoing edges. Here, the 
incoming edge is called leaf edge. This observation indicates the 
efficiency of C3NET in inferring leaf edges. Additionally, dashed 
line gray edges help to observe that colliders cause difficulties for 
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the respective edges. Here, collider means a node that has two 
incoming edges [22]. 

The observations in the study [22] show that leaf nodes can be 
inferred easily whereas the nodes which are not leaf nodes are more 
difficult to infer. Hence, the detection of hubs is not easy. However, 
due to the fact that they are connected to many other nodes, it is 
possible that one or more of these nodes may be a leaf node. 
Therefore, they are more likely to appear in the inferred network. 

The expression data, which consists of 524 microarrays, was 
obtained from Ref. [20]. For this data set, it has been shown that 
the results of CLR obtained by a manually assembled reference 
network, G 2067 is better than ARACNE and RN. Hence, in this 
section, the authors compared the inference algorithm C3NET 
only with CLR. Table 3 provides a summary of the results obtained 
for the comparison of C3NET and CLR [22]. 

The inferred network of E. coli is shown in the Fig. 11. In the 
figure, edges with solid line correspond to TP edges whereas the 
edges with dashed lines correspond to FP edges. The genes with 
gray color correspond to regulated genes whereas the genes with 
black color correspond to regulating (transcription factors) genes. 

Table 3 

Summary of results for C3NET and CLR obtained from our simulations 


3.2 Expression Data 
from E. coli 


Algorithm 

Interactions 

TP 

FP 

FN 

Precision 

C3NET 

99 

74 

25 

3017 

0.75 

CLR b 

274 

169 

105 

2922 

0.62 


''A threshold value of 6.974 obtained for the z-scores used by CLR 

b A threshold value of 0.414 obtained as a result of significance test of the MI values for C3NET [22] 



Fig. 11 Inferred £ coli network by C3NET. Black genes correspond to transcription 
factors and graygenesto regulated genes. Edges with solid line indicate true positive 
results whereas edges with dashed lines correspond to false positives [22] 
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3.3 Conclusions In this book chapter, we examine the five most widely known 

network inference methods—C3NET, RN, ARACNE, CLR, and 
MRNET—and discussed their performance in various biological 
and synthetic datasets and simulation conditions. The performance 
of GNI algorithms is assessed using a global performance metric. 
We also provide the implementation and usage of R packages of the 
algorithms C3NET and ARACNE. 

Sample size is an important factor affecting the performance of 
GNI algorithms. However, there is no commonly recommended 
sample size that provides the best or optimal performance of the 
GNI methods. The general opinion in this subject is that a larger 
sample size results in better inference performance. According to 
the study by Altay [38], the inference performance of the informa¬ 
tion-theory-based GNI algorithms tends to converge after a partic¬ 
ular sample size region around ~ 64 (see Note 5). The results of the 
study show that increasing the sample size over this region does not 
improve the performance substantially [38]. 

The study by Altay and Emmert-Streib [22] reveals that the 
conservative approach of C3NET, which allows each gene to add at 
most one edge to the inferred network, provides consistently better 
results in comparison with the other methods widely used [ 18-21 ]. 
According to the study results, C3NET gives a precision of 0.81 for 
the expression data from E. coli. This result is 31 % better than the 
nearest precision obtained by CLR algorithm which performs bet¬ 
ter than ARACNE and RN (see Note 6). This robust result indi¬ 
cates that the performance of a well-structured algorithm can be 
better than other methods that are more complex [22]. In another 
study, it was found that the other inference algorithms show a more 
sensitive behavior in dependence on the network type used [23]. 


4 Notes 


1. Obtaining the interaction scores among cell molecules is the 
main process in almost all GNI algorithms. A failure in this 
step often leads to an erroneous result in the ultimate inference 
process. According to the study results [5], BS2, BS3, KDE, 
PBG, and SBG are observed as the best performing estimators. 
However, the runtime of the KDE is large. Therefore, we advise 
using BS, PBG, and SBG estimators for applications in which the 
runtime is more important than the precision result. 

2. The inference algorithms RN, ARACNE, CLR, and MRNET 
are becoming more and more complex. This may cause serious 
difficulties in obtaining a balanced statistical analysis [22]. 

3. All other methods than C3NET aim to infer the entire regu¬ 
latory network for a given data set. However, achieving this goal 
is not easy for a large sample size. Observational data may not be 
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able to detect all dynamical interrelations that would allow a 
reliable estimation. Hence, C3NET aims to infer only the stron¬ 
gest interactions among covariates; this is called as conservative 
causal core or C3 [22]. 

4. An important characteristic of C3NET that is different to all 
previous methods is that it can infer at most as many edges as 
genes. The reason for this is that the second step of C3NET 
allows each gene to add at most one edge to another gene [22]. 

5. The sample size region of ~64 is sufficient to achieve good 
inference precision even for very large networks. However, F- 
score should be used to observe how close is the inferred net¬ 
work to the whole of the true network [38]. 

6. The performance of the inference method depends crucially on 
the characteristics of the data [22]. 
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Integrating Heterogeneous Datasets for Cancer 
Module Identification 

A.K.M. Azad 


Abstract 

The availability of multiple heterogeneous high-throughput datasets provides an enabling resource for 
cancer systems biology. Types of data include: Gene expression (GE), copy number aberration (CNA), 
miRNA expression, methylation, and protein-protein Interactions (TPI). One important problem that can 
potentially be solved using such data is to determine which of the possible pair-wise interactions among 
genes contributes to a range of cancer-related events, from tumorigenesis to metastasis. It has been shown 
by various studies that applying integrated knowledge from multi-omics datasets elucidates such complex 
phenomena with higher statistical significance than using a single type of dataset individually. However, 
computational methods for processing multiple data types simultaneously are needed. This chapter reviews 
some of the computational methods that use integrated approaches to find cancer-related modules. 

Key words Cancer modules, Cancer systems biology, Data integration, Gene-gene network, Multi- 

omics dataset 


1 Introduction 


Cancer is a common genetic disease involving a range of factors. 
Genomic, epigenomic, and differential gene expression aberrations 
all play vital roles in a cancer’s initiation, development, and malig¬ 
nance [1]. It has been reported by various studies that cancer- 
related activities including cell proliferation, angiogenesis, and 
metastasis are associated with abrupt changes in regulatory and 
signaling pathways [2-6]. Mutations involving somatic and copy 
number aberrations of some genes can either directly affect some 
key pathways or have a cumulative effect when they occur across 
network modules representing common functional activities in 
cancer [7, 8]. Consequently, identifying cancer modules is of 
primary importance to the effective diagnosis and treatment of 
cancer patients. 

One of the core steps of cancer module identification involves 
modeling gene-gene relationships in a network. Many algorithms 
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have been developed for this purpose, but most apply only to 
homogeneous datasets, that is, data of only one type, usually GE 
data or PPI information [9-15]. Most of the methods relying only 
on GE data apply differential expression analysis but it is often hard 
to determine whether such variations in expression are causative or 
merely an effect of complex diseases [16]. Differential expression 
analysis can produce false negatives and false positives: some impor¬ 
tant genes in cancer-related pathways may not be identified as 
differentially expressed, whereas some differentially expressed 
genes may not be relevant to cancer [17]. Typically CNA regions 
identified by some approaches [18-20] using only CNA datasets 
are spatially extensive, which makes it difficult to identify a specific 
gene causing genomic aberration [21]. PPI can provide important 
information in characterizing topological properties of the network 
involving cancer genes [7]. However, PPI information for multiple 
cell types and developmental stages is still incomplete, which limits 
its usefulness in developing methods for cancer module 
identification. 

Recent studies have demonstrated the “genomic footprint” of 
driver mutations on gene expression [21-23]. This happens when 
somatic mutations and copy number aberrations affect a gene’s 
transcriptional changes directly or indirectly [24] and thus perturb 
some core pathways relevant to cancer growth and malignance [1]. 
Research carried out for The Cancer Genome Atlas on both glio¬ 
blastoma [25] and ovarian carcinoma [26] demonstrated the simul¬ 
taneous occurrences of mutations, copy number aberrations, and 
gene expression changes in a significant number of patients in the 
core components of some key pathways (see Note 1). In this chap¬ 
ter we discuss some methods that find cancer-related modules by 
integrating multiple heterogeneous datasets. 

This chapter is organized as follows. We first briefly introduce 
some of the main sources of data that can be used and the required 
preprocessing steps essential for subsequent integrated analysis. 
Then, we describe methods that integrate information from het¬ 
erogeneous data sources to find cancer-related modules/sub¬ 
networks (see Note 1). Finally, we address some approaches for 
validating identified modules. 


2 Data Sources 


Gene Expression data from cancer samples can be primarily found 
in the database GEO (Gene Expression Omnibus) [27]. It is a 
database of gene expression values measured using high- 
throughput hybridization arrays (also known as chips or microar¬ 
rays). Sample values are reposited both in raw and normalized 
versions. Another comprehensive collection of gene expression 
data from various cancer samples is The Cancer Genome Atlas 
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(TCGA) [28]. There are three different levels of datasets available 
in TCGA: Level 1 consists of low-level (not normalized) data for a 
single sample probe, Level 2 consists of normalized single sample 
probe data, and Level 3 consists of aggregated gene-level data 
(grouped by mapped probes with gene symbols). Mutation, 
Copy number aberration, DNA methylation, and miRNA 
expression datasets can also be found in TCGA data portal. 

Preprocessing is an important step in data integration, 
especially when paired samples are used (see Note 2). Preprocessing 
of GE values includes scale transformation, imputing missing 
values, handling redundancies, pattern standardization (i.e., nor¬ 
malizing to a zero mean and unit standard deviation), and other 
transformations [29]. Preprocessing of CNA data in microarray 
chips is typically more complex than that of GE data, and can 
include quantile normalization, imputing missing values, summar¬ 
izing multiple probes at a single locus (with mean or median), 
segmentation of genomic regions, and mapping segmented CNA 
values in genomic regions into corresponding gene symbols 
[17, 30]. Probe level methylation data from CpG sites can be 
normalized between 0 and 1 by finding the following ratio [31]: 

B= _ max(M f ,0) _ 

1 (max(Af*•, 0) + max( U *, 0) + a) ' } 

where // is the Beta-value for an ith interrogated CpG site, and M t 
and Ui are the intensities measured by the ith methylated and 
unmethylated probes. After background adjustment, intensities 
(Mi and Uj) may become negative, but in the above definition 
those negative values are reset to 0. Again, when both M t and U t 
intensities are very low, a constant offset a (default value = 100) is 
added to the denominator to regularize Beta-value, as suggested by 
Illumina [31]. 


3 Methods for Integrating Heterogeneous Datasets 

Figure 1 generalizes a possible approach that integrates multiple 
heterogenous datasets in order to find cancer-related modules in a 
gene-gene network. The gene-gene network can be modeled 
either by exploiting combined knowledge from multiple datasets 
or by merging individual networks built upon corresponding data¬ 
sets. In these networks, nodes represent genes and the edges can be 
modeled as the relationships (i.e., directed and/or undirected) 
among them. PPI information can be useful at various stages of 
network modeling. After modeling the integrated network various 
module detection techniques such as optimization models, hierar¬ 
chical clustering, etc. can be applied to find cancer-related modules. 
The following sections describe some of the methods that use 
integrated approaches for cancer module identification. 
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Fig. 1 Schematic diagram of a possible integrated approach for cancer module identification. Each input 
dataset contains both cancer and normal samples. In network modeling, genes are identified based on 
differential information in the two-conditional studies (cancer vs normal), and edges can be defined according 
to pair-wise correlation 

3.1 iMCMC A method known as iMCMC (identify Mutated Core Module in 

Cancer) [32] was developed for the simultaneous analysis of three 
heterogeneous datasets: Gene expression (GE), copy number Aber¬ 
ration (CNA), and sequence mutations. These are combined to 
infer a network in which core cancer modules are identified 
(see Note 3). The method involves an optimization model followed 
by statistical significance tests. This method initially starts with 
building two different networks, one generated from GE data and 
the other by combining somatic mutations with CNAs over com¬ 
mon samples. These two networks are then combined to construct 
an integrated network. 

First, a binary matrix A 0 is constructed in which the columns 
represent the paired samples containing somatic mutations and 
CNAs, and the rows represent genes that the samples have in 
common. Each entry in A 0 is set to 1 if a mutation occurs in the 
corresponding gene and sample, or if there is a statistically sig¬ 
nificant copy number variation detected; otherwise the entry is set 
to 0. Genes that are mutated in the same samples in A 0 are com¬ 
bined into larger metapenes , and thus a new matrix A called the 
mutation matrix is obtained. Another data matrix, B is built from 
the expression values. Its entries are real values representing the 
relative expression of a given gene in a particular sample. The 
following two paragraphs explain the methodologies for construct¬ 
ing the Expression Network (EN) and Mutation Network (MN) 
from the data matrices B and A, respectively. 
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3.1.1 Constructing the 
Expression Network 


The Expression Network is based on the gene expression dataset. In 
this network, both nodes and edges are weighted. Nodes represent 
genes and their corresponding weights reflect the extent to which a 
mutation in that gene affects the expression levels of other genes. 
Each edge weight is defined as the absolute correlation between the 
expression levels of the two corresponding genes. 

The definition of nodes in the EN depends on both data 
matrices A and B. New sets of genes and samples are defined as: 
G = G a fl G b and S = S A D S B , where (G A , S A ) and (G^, S B ) are 
the sets of genes and samples in the two data matrices A and B, 
respectively. For each gene^ ^ G, the corresponding samples in S 
are classified into two groups, based on that gene’s mutation status 
in A. The numbers of samples in each group are denoted ri'p and 

nf\ Then, for each ^ a mutation-correlated expression vector 
ei = ( e[^\ eP ) is constructed, where eP and eP are defined as 
follows: 

T — {hi ■ Hi — 1, k e S j, 

L , ( 2 ) 
e f ] = yki ■ Hi = 0,k e S'j. 

Here a ki and b ki denote the entries for the i -th gene and &-th sample 
in the data matrices A and B, respectively. To determine whether 
there are significant differences between the expression levels in e^p 
and ef\p -\alues are calculated using mattest in MATLAB. A small 
p -value indicates that mutations in the gene in question affect the 
expression levels of other genes. Since there should be a minimum 
of two samples in each group for conducting this test, the set of 
nodes Gin the EN is defined as follows: 

-.nf>2,nf>2\ (3) 

And the weight of each node in EN is defined as follows: 

= Vjr.eG (4) 

r= 1 

where d is the total number of genes in G B and p r is the p- value 
calculated for gene g r as described above. The weight u l] of any 
edge in G is defined as the absolute Pearson correlation between 
two mutation-correlated expression vectors e t and e p among the 
samples in S . In the case of metagenes , node and edge weights are 
defined as the averages of the corresponding values of their constit¬ 
uent genes. 



124 


A.K.M. Azad 


3.1.2 Constructing the 
Mutation Network 


To build the Mutation Network (MN ) from the mutation matrix A, 
the same gene set G is used as for the network EN. The weight of 
any node (or gene),^- ^ G is defined as follows: 


hi = —, 


m 


( 5 ) 


where m is the total number of samples in A and mi is the total 
number of mutations occurring in the samples of A for a particular 
gene^-. The weight Vij of any edge between genes i n MN is 

defined as the ratio of the number of samples in which exactly one of 
the gene pairs is mutated to the number of samples in which at least 
one of the gene pairs is mutated in A. 


3.1.3 The Integrative An integrative network A4 is constructed by combining the 

Network expression network EN with the mutation network MN. It is 

necessary to first adjust the weights of nodes and edges in EN 
and MN so that they become comparable. Two balancing terms, 
£ and //, are defined for the networks EN and MN, respectively, as 
follows: 


, U V 

^ = r n = h' ( 6 ) 

where/ = max(f-) and u = max(z^y) in EN, and h = max(//*•) and 
v = max(rvy) in MN. Now, if F = {fi} and U = {z^y}, then the 
edge weights V and node weights f F are said to have balanced 
values in EN. Similarly, if H = {hi} and V = { Vij) , then the edge 
weights V and node weights rjH have balanced values in MN. 
A relative importance term can also be introduced to modify the 
relative impact of the two networks EN and MN on the integrated 
network. Let k denote the relative importance of MN relative to 
EN and set <5 •(-)=&, so 8 = k • In the remainder of this 
description, we set k = 1. Thus, node weights c t and edge weights 
Wij can be defined as follows: 

Wij = 8 ■ Uij + Vij ? 

c i = 3%'fi + hi, 

where z, j = 1, ..., n. Here, n is the total number of genes in G. 


3.1.4 An Optimization 
Model for Identifying Core 
Cancer Pathways 


The final step of this approach is to identify some core modules in 
the integrative network A"!, where each such module contains genes 
with both high node-weights and high edge-weights. For this 
purpose, an optimization model (previously reported by Wang 
et al. [33]) is employed. The optimization problem is stated as 
follows: 
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max ^^^ WijXjXj + A CiX^ 
i j 

s.t. x![ + xP 2 H -f- xP n = 1, (8) 

xi > 0, i = 1, ..., », 


where the non-negative vector x = (vi, # 2 , ... ,x n ) contains the 
degrees of each node in a particular module (sub-network). The 
first term in the objective function states the inter-connectivity 
within the module, whereas the second term specifies the degree 
of association between the nodes and the module. The role of the 
positive parameter A here is to balance these two terms (j^Note 4). 
In this model, the regularization constraint over the variable 
x = (vi, V 2 , ..., x n ) controls the number of nodes to be selected 
in the module and the parameter /? adjusts the strength of this 
regularization. Here we set /?= 1 to find small-sized core modules. 

The following iterative algorithm [33] provides an easy solu¬ 
tion of the above optimization model by finding a local maximum 
in the vicinity of a predetermined initial approximate solution: 

/ V 1 

2 (WVi+Ag 

* 2X T WX + AS"c i x t . 

\ i 

where W = {is the n x n edge weight matrix, and 
X = (x[,X 2 , ... is the solution vector at the £-th iteration. 

The non-zero entries in solution vector x define a particular mod¬ 
ule (sub-network) where in practice the entries are defined as zero if 
they are less than 0.1. Once a locally optimal solution is obtained, 
corresponding nodes are removed from the network and the whole 
process is repeated again to find additional modules. 

3.2 Wen et al. The method of Wen et al. integrates DNA methylation, gene 

expression, and protein-protein interaction datasets to identify 
causal network modules in colorectal cancer [34]. The method 
starts with collecting a set of candidate causal genes. This collection 
is the union of a set of differentially methylated genes and a com¬ 
mon subset of known cancer genes from DNA methylation chips, 
the Cancer Gene Census (CGC) [35], and tumor associated genes 
in the TAG database. Employing a minimum multi-set cover strat¬ 
egy due to Kim et al. [36], a gene is determined to be differentially 
methylated if its comparative [3 value (a measurement of DNA 
methylation level) between tumor and paired non-tumor samples 
is > 0.2 [37, 38]. 
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Next, a comprehensive protein-protein interaction (PPI) 
network is developed integrating five curated human PPI databases: 
HPRD [39], BioGrid [40], IntAct [41], MINT [42], and Reac- 
tome [43] . Only those interactions that are found in at least three of 
these databases are considered. The resulting network contained 
7001 nodes and 19,188 edges, where each edge e is assigned a 
weight calculated as follows: 


w(e) = 1 — \cor(x,y)\ 


= 1 - 8=1 

/ m / m 

(y-y) 2 

Here x = (vi, ..., x m ) and y = (y 1? ... ,y w ) are expression 
profiles of the two nodes in an edge £, and x and y are mean values 
of x and y, respectively. This PPI network is further decomposed 
into network modules by applying the Markov Clustering algo¬ 
rithm [44], but only those modules are selected which contain at 
least one candidate causal gene. The activities of each network 
module M t in sample Sj are calculated as follows: 


( 10 ) 


3 mj~^3nk 


M ij — 


E E — 

0 m <= {CGCnM,}^^^) e E{s m ) 

J E # ( £ U.)) 

V j m <E{CGCnMi} 


( 11 ) 


where is the set of edges belonging to the candidate causal 

gene^ w in module #(T(^ W )) represents the total number of 
edges in and jj mj is the normalized gene expression value of 

the gene^ in sample Sj. Next, a classifier is built for selecting the 
causal modules as follows: 


I $ $ control U 11 $ S 


tt 112 


< 0 , 


\s-s e 


| S S con trol 11 2 ^ 


f OT S S contro i 9 

for S e Sense, 

( 12 ) 


where S, S contro i, S case , S controh and 5^ are the sample, the set of 
non-tumor samples, tumor samples, the center of non-tumor 
samples, and the center of tumor samples set, respectively. These 
classifier conditions can be further simplified as follows (for details, 
see supplementary texts of original article): 


C • (x\,X 2 , . . . ,Xk) T < 0, 


( 13 ) 
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where x t is an indicator variable having value 1 if module M t is 
selected, and 0 otherwise; and C is a matrix that is defined as a 
function of M tj as follows: 


C := 



M\\ + • • • + M\ n \ 2 
n J 



Mkl + • • • + Mkn 


(14) 


Here, any element Qy of the above matrix C represents the 
contribution of the module Mj to the ith sample condition. The 
objectives of this classifier are twofold: (1) classifying tumor and 
non-tumor samples, (2) identifying a small number of modules. 
This module identification problem is modeled as a binary integer 
linear programming problem as follows: 


min > Xi + X > > C,y • 

*!,*2 J 

J =t J=t 

S.t. C • (x\,X 2 , • . . ,Xk) T < 0 


> 1, xi — 0,1, je{l,2.1}, 

<=i 


(15) 


where sis the number of samples. In this objective function, the first 
term encourages a small number of modules to be found whereas 
the second term implies the maximization of the classification 
abilities of modules by minimizing C (x\,X 2 , ... . X is the 

controlling parameter which balances the trade-off between those 
two terms. However, this binary integer linear programming model 
for module identification is computationally extensive. Therefore, 
this problem is further resolved by reformulating the model to a 
simple linear programming model where the binary variables 
Xi ^ {0,1} are relaxed to a continuous variables Xi ^ [0,1]. For 
further detail, see Note 5. 


3.3 Cerami etai The method of Cerami et al. [45] is an integrated approach for 

identifying core pathways altered in glioblastoma. It combines 
sequence mutation, copy number aberration (CNA), and protein- 
protein interaction (PPI) datasets. The first step of this method is to 
construct a global Human Interaction Network (HIN) from litera¬ 
ture curated data sources only. To cover more interaction informa¬ 
tion, the HIN is constructed based on the union of (a) interactions 
obtained from the HPRD website (http://www.hhprd.org/) and 
(b) various signaling pathway databases, specifically Reactome, 
NCI/Nature Pathway Interaction DB, and MSKCC Cancer Cell 
Map from Pathway Common (http://www.pathwaycommons. 
org). Information from the last of these pathway sources was in 
BioPAX format, which is represented as subgraphs of biochemical 
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networks. A set of rules was defined to map these subgraphs into 
binary interaction data. After removing all redundancies and self- 
directed interactions, the HIN contained 9264 genes and 68,111 
interactions. 

Sequence mutation and copy number datasets of Glioblastoma 
Multiforme (GBM) for paired samples were collected from TCGA 
data portal (https://tcga-data.nci.nih.gov/tcga/). Copy number 
aberration data was analyzed using the RAE algorithm [19] which 
discretizes all isoforms of autosomal genes into multiple putative 
aberration states, and finds statistically aberrant regions with 
//-values. Next, the statistical significance of each gene’s aberration 
is defined as the minimum of the //-values of all the spanning 
regions over the corresponding gene’s coding locus. A set of altered 
genes is identified, where a gene is defined as altered if it has a 
validated non-synonymous somatic nucleotide substitution, or a 
homozygous deletion, or a multi-copy amplification only. 

Next, a GBM-specific network was constructed in which the 
node set is the union of the set of altered genes and a set of linker 
genes. For each gene in the altered gene set, the corresponding 
neighbor genes are identified in the HIN. Neighbor genes having 
degree one are trivially ignored, as they are connected to exactly 
one altered gene. The remaining neighbor genes with degree > 2 
have the potential to connect two or more altered genes, and are 
thus considered to be candidate linker genes. Only linker genes that 
are found to be statistically significant by a hypergeometric test 
among all other candidate linker genes are further assessed. The 
null hypothesis is: the linker genes eonneet the observed number of 
altered genes in HIN only by ehtmee. ^-values from the statistical 
assessment of this hypothesis are further corrected using the Ben- 
jamini-Hochberg procedure [46] giving corresponding //-values, 
and the genes having //-values <0.05 are selected as a final list of 
linker genes. The final network contained six linker genes connect¬ 
ing 66 GBM altered genes, and their corresponding PPI interac¬ 
tions in the HIN. 

To find network modules in the resulting GBM-specific 
network, the edge-betweenness algorithm was applied. Originally 
proposed by Girvan and Newman [47], this algorithm applies a 
divisive approach where at each iteration an edge with the highest 
edge-betweenness score among all other edges is identified and 
removed from the network in order to reveal modular structure. 
The edge-betweenness seore of a particular edge is defined as the 
number of shortest paths between pairs of nodes that traverse that 
edge [47]. More specifically, the shortest paths between all pairs of 
vertices are identified, and then for each edge the number of short¬ 
est paths that include that edge is counted and considered as the 
edge-betweenness seore for that particular edge. After each edge 
removal, the edge-betweenness scores of the edges of the updated 
network are recalculated. (Only those edges which are affected by 


Integrating Heterogeneous Datasets for Cancer Module Identification 


129 


3.4 VToD 


3.4.1 Constructing a 
Gene-Gene Relationship 
Network 


the particular edge removal require recalculation of this score.) To 
obtain a partition yielding the best modular structure, network 
modularity [48] is also calculated after each edge removal. This 
process continues until there are no remaining edges. The maxi¬ 
mum network modularity score obtained during this process indi¬ 
cates the optimal number of edges to be removed. The network 
modularity score is defined as follows: 



where N M is the number of modules, l s is the number of edges 
within module j, L is the total number of edges in the network, and 
d s is the summation of the degrees of all the edges within module 
Modularity quantifies the fraction of network edges connecting the 
nodes within modules minus the expected number of network 
edges obtained by forming random connections among the nodes 
within the module, subject to the same modular divisions. A value 
of M close to 0 indicates that the number of within-module edges is 
consistent with random formation, whereas a value close to 1 
indicates stronger modular structure. This procedure results in a 
set of modules extracted from the GBM-specific network. 

VToD [17] integrates gene expression (GE), copy number aberra¬ 
tion (CNA), and PPI (protein-protein interaction) datasets in 
order to find cancer-related modules in glioblastoma and ovarian 
cancer patients. The GE and CNA data matrices are obtained from 
TCGA data portal [28]; both are Level 3 datasets. The PPI dataset 
is obtained from Cerami et al. [45]. This method provides an 
integrated framework that infers pair-wise relationships between 
genes based on both data-driven and topological properties (see 
Note 3). A data-driven property of a pair of genes is a correlation 
observed between the data obtained for those genes. These corre¬ 
lations may be of three types: GE-GE, GE-CNA, or CNA-CNA 
correlations. Data-driven properties also include the indirect rela¬ 
tionships discussed below. Topological properties are connections 
observed in PPI networks. 

The method starts with a set of seed genes S, thought to be related 
to cancer progression and malignance. This set is a union of a set of 
differentially expressed and a set of significantly altered genes. 
Differential expression is identified using a two-tailed pooled 
£-test, and the corresponding ^-values are corrected using the Bon- 
feronni correction. A set of significantly altered genes is found by 
mapping gene symbols to the collected focal aberrant regions [25, 
26] identified by GISTIC [18] and RAE [19] algorithms. Next, the 
Gene-Gene Relationship Network (GGR), a weighted undirected 




130 


A.K.M. Azad 


network, is defined. Nodes of this network represent the seed genes 
and edges represent direct or indirect pair-wise relationships among 
genes. The absolute value of the Pearson correlational coefficient 
(PCC) is used to identify pair-wise relationships between genes, and 
as a weight on each edge. 

For any gene-pair (geneigene y), all three types of absolute PCC 
value (GE-GE, GE-CNA, and CNA-CNA) are calculated, depend¬ 
ing on data availability. The maximum of these is defined as the 
data-driven property of that particular gene-pair and termed its 
r_value. For the gene-pairs (gene^ gene}) this r_value is considered 
to be 0. If an r_value is greater than some threshold, then a direct 
relationship is defined for that particular gene-pair. The gene-pairs 
for which a direct relationship is not found may still be connected if 
an indirect relationship is identified. An indirect relationship 
between two particular genes is a statistically significant simple 
path joining those two genes in the PPI network (see Note 6 ). To 
identify such statistically significant paths, the observed paths 
between particular gene-pairs are compared with the path in a 
random PPI network, which is generated in such a way that gene 
interactions are randomly assigned while the network topology and 
gene expression values are the same as those in the observed PPI 
network. In other words, the random PPI network has the same 
number of interactions (edges) as the observed one, but the genes 
(nodes) of the observed PPI network are shuffled in the random 
PPI network. The null hypothesis for this statistical significance test 
is: the geometric mean ofr_mlues of the simple path found in random 
PPI network is greater or equal to that of the observed path. In order 
to reduce the time complexity, a heuristic search is applied only for 
those gene-pairs for which there is a connection in the PPI network 
(see Note 6 ). All the simple paths between two genes with a fixed 
path length are identified using a breadth first search (BFS) algo¬ 
rithm. Furthermore, only those simple paths are selected in which 
all the constituent genes have either GE, or CNA, or both datasets 
available. Since there can be multiple such paths found, a path P* 
with maximum average PPI connectivity is selected: 


P* = max 
p 


l »^ 

— 7 norm_deg(genei) 

n U 


(17) 


where norm_deg(gene *•) is the degree of connectivity for genei nor¬ 
malised by the global maximum connectivity in the PPI network, 
and n is the number of genes along the path. The statistical signifi¬ 
cance of the path P* is measured as above, and is selected if its 
corresponding ^-value is below 0.05. For the gene-pairs for which a 
statistically significant path is found, an edge is added to the GGR 
network, where the edge weight is the average of all the pair-wise 
r_values of gene-pairs along the path P*. 
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3.4.2 Module Detection 


Next, a Voting based module detection algorithm identifies over¬ 
lapping modules in the GGR network by combining To pological 
and Data-driven properties. The name of the method—VToD—is 
an acronym for this procedure. First, a pairwise score (vote) is 
calculated for every pair |^, m) ^ S using the following equation: 


vote(jj , m) 


norm_dejj(m) 
SPL(£f , m) 


+ r_value{£f, m) 


( 18 ) 


where above norm_dejj( m) is the degree of connectivity of m nor¬ 
malized by the global maximum PPI connectivity, SPL(^, m) is the 
shortest path length between the two genes in the PPI network, 
and r_value(£f, m) is the relationship value calculated for the con¬ 
structed network GGR. This definition states how much vote-score 
a gene m can get from another gene g, for any pair [g, m) ^ S. 
Note, the vote(g , m) score in the above equation is not a symmetri¬ 
cal measure because of the definition of the topological property 
(: norm_deg(m) in above equation). A high score indicates either (1) 
a gene-pair j^, m) has high data-driven relationship r_values or (2) 
any genesis interacting with a gene m with a high topological value 
in the PPI network. Note, the shortest path length SPL is con¬ 
strained by a user-defined threshold to control the compactness of 
the module. If any of the shortest paths has length above that 
threshold, that path is ignored. 

Next, for any gene g ^ 5, corresponding vote-scores with all 
the genes m ^ S (including^) are stored in a table. Here, vote(g , jj) 
is defined with the norm_deg(g) only, since r_value(g y g) = 0, and 
SPL(jj y jj) is not defined for the PPI network as it doesn’t contain 
any self-loop. Next, the table for the gene g containing vote-scores 
of all the genes m ^ S is sorted in descending order of vote-score. 
In that sorted table, the ranking of each gene m is defined as its local 
rank. Then, in that sorted table, the cumulated vote-score from the 
top-ranked vote-scores of the m S) genes is calculated. If the 
cumulated vote-score of the top-ranked m gene(s) is (are) within 
the top k% (a user-defined threshold) of total cumulative vote-score 
in that particular table (for gene^), then that(those) top-ranked m 
gene(s) are considered as candidate representative gene(s) of that 
particular gene^. Next, if the vote (jj, m) score(s) of this(these) top- 
ranked m gene(s) are within top vote_th% (a user-defined threshold) 
of the distribution of all pair-wise vote-scores (considered as the 
global rank of the gene w), then this(these) m gene(s) are finally 
selected as a representative gene(s) of the particular gene^. Thus, 
this technique makes it possible to find overlapping modules in the 
network by allowing multiple representative m genes to be selected 
for a particular gene g. More importantly, this method can select a 
gene m ^ S (i.e. a hub-gene in PPI network) as a representative 
gene for multiple^ ^ S genes, thus revealing a modular structure. 
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Next, these modular structures, called “pre-modules,” are formed, 
each with a representative gene m in the center and aggregating all 
the genes g that chose w. A pre-module is defined as the initial state 
of a module before merging it with other pre-modules to get the 
final module. After removing redundancies and small pre-modules 
(typically with < 3 genes), a module merging algorithm is con¬ 
ducted. Two pre-modules merge if their pair-wise members are 
closely connected in the PPI network (topological property) or 
highly related in GGR (data-driven property). For this purpose, a 
pair-wise merging value MV(Q, CJ) between any two pre-modules 
Q and Cj is calculated as follows: 

MV{Ci , Cj) - Cy) 

ni 

+ E E r -™ lu <9»si) ( 19 ) 

1 J Sk e C iSl e C J 


where n and fi j are the sizes of two pre-modules Q and C p 
respectively, and n l < tij (Note, here it is assumed that Q is bigger 
than Cj). Inter-connectivity IC(Q, CJ) is a kind of topological 
property relating Q to Cf. it is the proportion of genes in Q having 
at least one PPI partner in Cj. The second term in the above 
equation denotes the data-driven property for the pair Q and Cf. 
it is the average of the gene-gene relationship values over all pairs of 
a gene in Q with a gene in Cj. At each iteration of the module 
merging procedure, two pre-modules with the highest pair-wise 
merging value (calculated using the above equation) are merged 
together and replaced by the newly merged module. This merging 
process continues until the highest pair-wise merging value at some 
iteration becomes less than some threshold merging_th (for the 
details of this threshold selection, see supplementary method of 
original article). 


4 Validating Cancer Sub-networks 

There are several ways to validate cancer modules identified by the 
above procedures. Most of them involve statistical hypothesis test¬ 
ing and are specific to the methodology used to identify modules. 
However, there are a few general techniques that can be used to 
validate modules, as follows: 

Ideally, a modular network is expected to have dense intra-module 
connections but sparse inter-module connections. Therefore, pro¬ 
posed networks can be assessed for both high density of connections 
within modules and high separability of component modules. 
Equation 16 states the modularity measurement [48] which 


4.1 Topological 
Validation 
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4.2 Enrichment 
Analysis 


compares the connection-density of a particular module with that of 
a module formed by making random connections among its con¬ 
stituent genes. Similarly, the following equation quantifies the sepa¬ 
rability of modules [48]. 


seperationScore = 


N m 


E 




( 20 ) 


where N M is the number of modules, l s is the number of edges 
within module j, and d s is the summation of the degrees of all the 
edges within module Both ‘"Modularity” and “Separability” 
scores can be calculated using above equations (Eqs. 16 and 20, 
respectively) where higher “Modularity” value indicates stronger 
modular structure, and higher “Separability” score indicates a par¬ 
ticular module is more easily separable from the original network 
topology by deleting some edges, respectively. 


f-Measure Modules can also be validated using a quantity known as 
an /-measure [48]. This quantity evaluates the accuracy of identified 
modules by comparing them with known reference modules such 
as: GO functional categories, known biological pathways, and 
others, /-measure can be calculated using the following equation: 

. 2 x Precision x Recall 

f —measure =-—-—— (21) 

Precision + Recall 

where Precision = and Recall = are the true positive 

rate and positive predictive value , respectively. Here, Mis a particu¬ 
lar module and Pi is a known functional module. For example, a 
Module M (typically, a set of genes) is mapped to a known func¬ 
tional category F t \ “Cell Cycle,” then the Precision and the Recall 
are the fractions of genes common to both M and Pi to the size of 
M, and to the size of /, respectively. Bigger modules will have 
higher Recall values, whereas smaller modules will have higher 
Precision values. Therefore, the accuracy of any identified module 
M can be measured by calculating the harmonic mean of these two 
values as / measure. 

Hypergeometric Analysis A hypergeometric test can also be used 
to assess modules statistically [48]. P -values can be calculated using 
the hypergeometric distribution to indicate the significance of cor¬ 
respondence between a module and a known functional category. 


p —value = 1 


k-i 


E 



in-w 

n — i 



( 22 ) 
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where | V\ is the total number of genes (i.e., all the genes in human 
genome), \X\ is the number of genes in a known functional category 
(such as a GO term or known pathway), n is the number of genes in 
an identified module, and k is the number of genes in the intersec¬ 
tion of that particular module with the known functional category. 
Here, a low ^-value indicates that the identified module is signifi¬ 
cantly enriched in known functions or pathways. For example, a 
“dhyper” function in a built-in R-package called “stats” can be used 
to calculate ^-values of the hypergeometric test [49]. 


5 Notes 


1. In general, most of the integrative approaches that aim to find 
cancer-related modules are based on a common hypothesis: 
tumors are characterized by aberrations in specific biological 
modules that are critical in terms ofi cancer initiation and malig¬ 
nance. There are two major steps in such methods, (1) building 
the network model, and (2) identifying modules (sub¬ 
networks). In defining gene dependencies in network models, 
some methods rely on PPI information only [15, 45], some on 
data-driven information only [32, 50-52], and some on both of 
those properties [13, 17]. 

2. In any integrated approach, higher statistical significance is 
achieved by using paired sample data rather than unpaired 
data. Moreover, pair-wise relationships between genes obtained 
by integrative approaches applied to unpaired sample data may 
produce false positive results [24]. Here, paired data indicates 
using various heterogeneous data types (e.g., GE, CNA, meth- 
ylation, miRNA) measured on the same samples. However, 
appropriate data normalization and standardization techniques 
are crucial to obtain correct inferences using paired data. 

3. Integrating as many heterogeneous datasets as possible can 
improve characterizations of driver genes and cancer modules. 
Zhang et al. found that the integration of three heterogeneous 
datasets (GE + CNA + mutation) provides additional useful 
information and can produce statistically significant core mod¬ 
ules in both glioblastoma and ovarian cancer compared to the 
integration of two heterogeneous datasets (GE + mutation, or 
CNA + mutation) [32]. Similarly, Azad et al. showed that mod¬ 
ules found by combining topological and data-driven properties 
(PPI + GE + CNA) of gene-pairs result in better functional 
enrichment than those found by using only topological (PPI), 
or only data-driven (GE + CNA) properties [17]. Akavia et al. 
reported that combining CNA and GE provides greater sensi¬ 
tivity for identifying RAB27A as a novel driver gene in a 
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melanoma dataset. They also showed that this gene would not 
be selected based on CNA alone [21]. 

4. The parameters of the iMCMC method for integrating somatic 
mutation, CNA, and GE datasets are set in such a way that the 
method can balance the influence of different data sources on 
the network, and on the vertex and edge weights [32]. 

5. The problem of module identification in Wen et al. is formulated 
as a binary Integer Linear Programming (ILP) problem, which is 
NP-hard. To resolve this issue, the binary variables Xj ^ {0,1} 
are relaxed to continuous variables xi ^ [0,1]. The problem is 
then solved using a simple linear programming algorithm. To 
choose the penalty parameter 2, the classification ability of the 
identified modules is defined as follows: 

CP = max^C • (vi,V 2 , ... (23) 

where the term on the right-hand side is the maximum element 
of the vector. The ILP is then solved for each value of 2 between 
0 and 1, in increments of 0.01, and the value of 2 that produces 
the smallest value of CP is selected. The justification for this is 
the observation that smaller values of the elements of 
C • (x\,X2, • • • ,Xk) T indicate a greater ability to distinguish 
between cancer and normal samples. 

6 . VToD combines GE, CNA, and PPI information among gene 
pairs to find cancer-related modules. In searching for indirect 
relationships among gene-pairs, VToD considers the sub¬ 
network (with the genes for which pair-wise direct relationships 
are not defined) as fully connected. Therefore, to find a statisti¬ 
cally significant indirect relationship considering a set of inter¬ 
mediate genes is an NP-hard problem. This problem is solved 
heuristically by restricting pair-wise adjacency among gene-pairs 
employing PPI information only, and converting that problem 
into finding a statistically significant simple path between gene- 
pairs. However, a threshold for the length of a simple path is a 
crucial parameter for handling time-complexity in this regard. 
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Chapter 8 


Metabolic Pathway Mining 

Jan M. Czarnecki and Adrian J. Shepherd 

Abstract 

Understanding metabolic pathways is one of the most important fields in bioscience in the post-genomic 
era, but curating metabolic pathways requires considerable man-power. As such there is a lack of reliable, 
experimentally verified metabolic pathways in databases and databases are forced to predict all but the most 
immediately useful pathways. 

Text-mining has the potential to solve this problem, but while sophisticated text-mining methods have 
been developed to assist the curation of many types of biomedical networks, such as protein-protein 
interaction networks, the mining of metabolic pathways from the literature has been largely neglected by 
the text-mining community. In this chapter we describe a pipeline for the extraction of metabolic pathways 
built on freely available open-source components and a heuristic metabolic reaction extraction algorithm. 

Key words Metabolic pathway, Metabolic interaction extraction, Text-mining, Natural language 
processing, Named entity recognition, Information extraction 


Abbreviations 

NER Named entity recognition 
NLP Natural language processing 
PPI Protein-protein interaction 


1 Introduction 


PubMed currently (as at February 2016) contains over 25 million 
article records, and this number is increasing at a faster rate than 
ever [1]. In some fields, researchers are encouraged, or even 
required, to submit results to databases in a standard format. For 
instance, upon solving the structure for a protein, an X-ray crystal- 
lographer will submit the structure to the Protein Databank (PDB) 
as well as submitting the results in a paper for peer review [2]. This 
allows anybody with an Internet connection to find curated 
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structures of proteins of interest quickly and efficiently. The data, 
being in a standard format, is also easily consumed by computer 
programs allowing large scale studies involving many structures to 
be carried out. For instance, the CATH project has developed a 
semi-automated system for classifying protein domain structures 
through the comparison of structures in the PDB [3]. 

Unfortunately, this method of submitting results in a standard, 
computer-readable language is found in few areas of bioscience. 
Currently, the vast majority of data in many biomedical subdomains 
is only available as unstructured text spread across many publishers’ 
websites. 

The study of metabolic pathways is one such field that suffers 
from a lack of manually curated data in databases. BRENDA is a 
large database of curated metabolic reactions, but individual reac¬ 
tions are not linked together to form pathways (meaning that there 
is little motivation to curate complete pathways from single organ¬ 
isms) [4]. KEGG [5] and BioCyc [6] are two databases that were 
developed to curate metabolic pathways. Ultimately, however, the 
databases are populated by human curators, which means it is 
practically impossible to keep up with all new articles being pub¬ 
lished. Training of a FlyBase Genetic Literature Curator, for 
instance, can take 6 months in addition to the time taken to actually 
perform a curation [7]. 

Computer processors are designed to follow very strict com¬ 
mands. This is reflected in the languages that are used to command 
computers, which, while incredibly varied, ultimately come down 
to providing a list of instructions for the various pieces of hardware 
within the computer. Therefore, the data which a computer pro¬ 
gram is designed to read and process must be stored in a strict 
format which the program can follow strict instructions to parse. 
Computer hardware, programs, and data storage formats are all 
designed from the bottom up with this philosophy in mind. Natural 
language, however, isn’t designed, but is constantly evolving and 
rules can often be hard to define. The English language is particu¬ 
larly notorious for having significant exceptions to the majority of 
spelling and grammatical rules. 

Text-mining development in bioscience initially focused on 
systems for named entity recognition (NER), the process of classi¬ 
fying elements in text into predefined categories. Current state-of- 
the-art NER tools (focusing on entities such as proteins, small 
molecules, drugs, and organisms) are able to achieve very high 
levels of accuracy—typically with F- scores greater than 90 % (see 
Note 1 ). Focus has, therefore, shifted to interaction extraction, the 
process of determining the nature of relationships between differ¬ 
ent named entities. Interaction extraction can be used to determine 
abstract relationships, such as gene-disease relationships, or more 
direct, physical relationships, such as protein-protein interactions 
(PPIs). As metabolic reactions fall into the latter category and the 
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extraction of PPIs is the topic upon which most research has 
focused, a review of PPI extraction methods provides a useful 
backdrop for the development of an extraction method for meta¬ 
bolic reactions. 


2 Existing Protein-Protein Interaction Extraction Methods 

There are a range of experimental methods that have been devel¬ 
oped to characterize PPIs ranging from narrow focused methods 
such as X-ray crystallography, which offers the most convincing 
evidence that two proteins form a stable complex, to broad scoped 
methods such as yeast two-hybrid screens, which can find potential 
binding partners from a large pool of proteins. The IntAct database 
[8] (which contains curated PPIs from the 14 members of the 
IMEx Consortium [9]) contains interactions extracted from more 
than 14,000 publications (as of April 2016). While this is a monu¬ 
mental manual effort, it is still only a small fraction of the available 
material. 

PPI extraction was the subject of one task at BioCreative II 
[10] in 2006, where teams were tasked with extracting PPIs from 
documents curated by IntAct and MINT (which were separate 
databases at the time before merging in the IMEx Consortium). 
Extracted interactions could then be compared to the gold stan¬ 
dard, manually curated interactions. The best performing tool 
achieved an E-score of 29 %, far lower than the high performance 
achieved by NER tools. Two general approaches to the problem 
were identified in the subsequent analysis of the submitted tools— 
which have been termed as local association analysis and global 
association analysis. Local association analysis identifies co¬ 
occurring proteins at either the sentence or passage level and may 
use other approaches such as interaction word lists and/or machine 
learning techniques to determine if an interaction between the co¬ 
occurring pair is described. Global association analysis focuses less 
on the characteristics of individual sentences, but rather looks at the 
co-occurrence of protein names multiple times in a document or 
over the whole collection. Global association analysis is more suit¬ 
able for extracting well-known interactions that are described fre¬ 
quently in the literature, but only local association analysis is able to 
determine novel interactions that have only been described once. 
The method described here incorporates both local and global 
association analysis. 

Kabiljo et al. [11] carried out a comparison on a range of PPI 
extraction tools including AkanePPI [12], OpenDMAP [13], and 
Whatizit [14]. AkanePPI is a state-of-the-art tool that utilizes many 
natural language processing (NLP) methods. OpenDMAP is a 
general purpose information extraction platform which uses a heu¬ 
ristic approach. The patterns for PPI recognition were created 
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manually to adapt the tool to the task. Whatizit is a suite of tools 
that can perform many bioscientific NLP tasks. The PPI extraction 
tool in Whatizit, Protein Corral, uses three methods which utilize 
co-occurrence and heuristic techniques. 

A simple baseline method was also developed for the compari¬ 
son. The method was co-occurrence based, looking for two protein 
or gene names within the same sentence as well as an “interaction” 
verb, such as binds or phosphorylates (a manually curated list of 
“interaction” verbs was used), in between the two entities—a simi¬ 
lar methodology to the Co3 method of Protein Corral. The tools 
were evaluated on five PPI corpora. While performance across the 
five corpora by each tool was variable, the simple baseline method 
showed an overall performance that was comparable to the more 
sophisticated methods, while being far simpler. We followed this 
simple methodology in the development of a metabolic pathway 
extraction method. 

BioCreative III [15] proposed a slightly different PPI extrac¬ 
tion task to that in BioCreative II. The task required the develop¬ 
ment of a tool capable of classifying and ranking abstracts according 
to their suitability for manual curation of PPIs in the full text. This 
behavior is required by PPI databases, such as IntAct, to effectively 
manage their curator man-hours and to prevent the needless cura¬ 
tion of irrelevant articles. Semi-automated selection of articles for 
manual inspection is common across the majority of biological 
annotation databases, but is typically carried out using simple 
PubMed searches. While effective at selecting articles relevant to a 
particular entity, this method is inadequate when dealing with 
complex events and interactions involving multiple entities [16]. 

Jamieson et al. used text-mining to recreate the HIV-1, Human 
Protein Interaction Database [17]. Protein NERwas carried out by 
BANNER [18] while interactions in text were identified using 
2 tools, the Turku event extraction system [19] and EventMiner 
[20]. The NERand event extraction were applied to 3090 titles and 
abstracts and 49 full-text articles achieving a precision, recall, and F- 
score of 87.5 %, 90.0 %, and 88.6 %, respectively. The pipeline was 
able to completely replicate over 50 % of the database. The team 
observed that the greatest obstacles to the automated extraction 
were grammatically complex sentences and sentences containing 
poor grammar. 

While new methods for extracting PPIs are regularly released 
[21, 22], attention is increasingly shifting towards more complex 
relationships, with a particular focus on biomolecular networks and 
pathways [23] such as PPI networks [24, 25], signal-transduction 
pathways [26-28], protein metabolism (synthesis, modification, 
and degradation) [23], and regulatory networks [29, 30]. This 
protein/gene-centric focus has been enshrined in most of the 
competitive text-mining events (such as BioCreative [31-33] and 
BioNLP [23]). Indeed, in spite of this new focus on networks and 
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pathways, one of the most important sub-topics—the construction 
and curation of metabolic pathways—has largely been ignored. 


3 Metabolic Interaction Extraction 

Humphreys et al. [34] developed the template-based EMPathlE 
system which aimed to extract metabolic reactions with contextual 
information such as the source organism and pathway name. The 
system achieved 23 % recall and 43 % precision on a small corpus of 
seven journal articles. EMPathlE is no longer under active devel¬ 
opment (R. Gaizauskas, personal communication) and no tool has 
since been released to attempt to solve the problem. There are 
certain generic systems, such as the GeneWays system for “extract¬ 
ing, analyzing, visualizing and integrating molecular pathway data” 
[26] and the MedScan sentence parsing system [35], that could 
potentially be applied to metabolic pathway extraction, but neither 
are freely available. 

There appears to be a perception that metabolic pathway 
extraction is a significantly more difficult problem compared to 
gene/protein interaction extraction. In evaluating the performance 
of GeneWays, the developers chose the extraction of signal- 
transduction pathways instead of metabolic pathways, suggesting 
that the former problem was an “easier target.” Metabolic pathway 
extraction has a number of specific challenges compared to PPI 
extraction: 

• Multiple entity types: Metabolic reactions consist of both pro¬ 
teins and small molecules, while PPIs consist of a single entity 
type (for the purposes of NER, genes and proteins are 
indistinguishable). 

• Entity mismatch: There is a mismatch between the type of 
entities involved in a metabolic reaction, enzymes, and metabo¬ 
lites, and those extracted by popular NER tools, genes/proteins, 
and small molecules. 

• Ternary and n -ary interactions: While PPIs are typically consid¬ 
ered binary (e.g., “protein A binds to protein B”), metabolic 
reactions can consist of many entities including an enzyme, 
multiple substrates, and multiple products. Furthermore the 
enzyme is optional and may or may not be present in a descrip¬ 
tion of a reaction, and the number of metabolites can vary 
significantly. With a greater number of entities the likelihood 
of information being split over multiple sentences increases. 

In developing the method we describe here, we focused on the 
goal of providing assistance to database curators and model builders 
as opposed to the fully automated curation of the literature. Such 
systems are commonplace in large scale curation efforts, such as 
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FlyBase [36] and the Comparative Toxicogenomics Database [37]. 
In this context, high recall is typically deemed to be of paramount 
importance, although excessive numbers of false positives detract 
from the usability of such systems [38]. 


4 Components for Developers 

Although the systems described so far cover a wide variety of 
general purpose and specialist requirements, the unique character¬ 
istics of many bioinformatics projects necessitate the development 
of bespoke solutions. Also, it is frequently the case that algorithms 
tuned for one specific domain will outperform their general pur¬ 
pose rivals. Fortunately, there are various standalone programs and 
libraries available that encapsulate functional building blocks of 
text-mining systems and are available for free, some of which have 
been developed from the ground up for bioinformatics applications 
or retrained on biological data. Since there are hundreds of NLP 
components and libraries available from the broader computational 
linguistics community, this section will give preference to those that 
are of particular utility or interest to bioinformatics developers. All 
the tools we discuss are available as Java libraries, the most popular 
language in the biomedical text-mining community. 

4.1 General Text- Libraries written in most programming languages exist for carrying 

Mining Tools out basic NLP tasks such as sentence parsing and part-of-speech 

tagging. One toolkit was found to fit our criteria: Apache 
OpenNLP [39]. 

OpenNLP is a machine learning-based Java library and, as such, 
requires extensive training to create a suitable probabilistic model. 
While a large number of models are provided alongside the library, 
they are all the result of training the library on general-use language 
(typically from newspaper articles). The language used in biomedi¬ 
cal articles, however, is highly specialized [40]. Buyko et al. [41] 
showed that transferring OpenNLP components to the biomedical 
domain was as simple as retraining the tool using a biomedical 
corpus, however, and that a specially designed tool was not 
necessary—for the low level text-mining tasks that OpenNLP 
deals with, at least. OpenNLP was retrained separately on two 
corpora, GENIA [42] and PennBioIE [43], and the subsequent 
performance of five OpenNLP components was assessed. Each 
component performed well when trained with either corpus, with 
the sentence splitter, tokenizer, and parts-of-speech tagger achiev¬ 
ing accuracies of approximately 99 % and the chunker and parser 
achieving average F -scores of 92 % and 86 %, respectively. The 
group have released the trained models [44], allowing OpenNLP 
to be implemented with no need of further training. 
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4.2 Named Entity 
Recognition 


4.2.1 Gene/Protein NER 


4.2.2 Small Molecule 
NER 


4.2.3 Organism Name 
NER 


OpenNLP was incorporated into the previously mentioned 
Mayo clinical Text Analysis and Knowledge Extraction System 
where a number of components were built on OpenNLP compo¬ 
nents trained on clinical data [45]. 

NER, typically the first step taken in a text-mining operation, aims 
to find entities within text and assign each to a predefined category. 
Solutions have been developed for the recognition of a wide range 
of biological entities, but here we focus on those relevant to meta¬ 
bolic networks: genes/proteins, small molecules, and organism 
names. 

The recognition of gene and protein names is one of the best 
studied fields in biomedical text-mining with the task being the 
focus of many early competitions, such as BioNLP and BioCreative. 
Until 2008 the best performing, freely available tool was ABNER 
which was able to achieve an E-score of 83.7 % on the BioCreative I 
test corpus. 

Leaman and Gonzalez, recognizing a lack of freely available 
tools, developed BANNER, an open-source gene NER tool based 
on conditional random fields [18]. BANNER achieved an F- score 
of 82.0 %—coming between the 9th and 10th ranked entries of 
BioCreative II—while ABNER achieved an E-score of 78.3 % on 
the same test corpus. This performance was repeated by Kabiljo 
et al. [11] who found that BANNER outperformed ABNER on 
four different corpora. 

The recognition of small molecules has been the focus of signifi¬ 
cantly less research. In 2006, Corbett and Murray-Rust released the 
first freely available chemical NER tool, OSCAR3 [46], followed by 
OSCAR4 in 2011 [47]. OSCAR, with its ability to recognize both 
vernacular and systematic names, is widely used. The first tool to 
compete against OSCAR4, ChemSpot, was released in 2012 [48]. 
The only comprehensive comparison of these tools is found in the 
ChemSpot paper where both tools were tested against the SCAI 
chemical corpus [49]. OSCAR4 achieved an E-score of 57.3 % 
while ChemSpot achieved 68.1 %. The method described here 
utilizes OSCAR4, but ChemSpot would certainly be worth inves¬ 
tigating in its place. 

Recognizing mentions of organism names is necessary to determine 
the context of any entities or interactions found in an article. 
LINNAEUS is principally a dictionary-based method which imple¬ 
ments some heuristic rules [50]. A dictionary of organism name 
synonyms was created using the NCBI Taxonomy database and 
abbreviated names were generated for each entry. Additional syno¬ 
nyms were identified that occur frequently in the literature—such 
as patient referring to Homo sapiens. 
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The tool performed well on a manually annotated corpus of 
100 full-text articles from the PMC Open-Access Subset with 
94.3 % recall and 97.1 % precision. The BioCreative III gene nor¬ 
malization task required the entries to determine the source organ¬ 
ism of genes in order to link them to database entries [33]. 
LINNAEUS was the only publicly available organism NER tool at 
the time and was used by the vast majority of teams. While the 
teams did note some ambiguity in species names and taxonomy 
IDs, the performance of LINNAEUS was well regarded. 


5 Case Study: A Metabolic Pathway Extraction Tool 

Here we present a heuristic metabolic pathway extraction method. 
The method can be split into four principal subtasks: 

1. The retrieval of relevant documents to mine. 

2. The extraction of individual metabolic pathways using a heu¬ 
ristic text-mining algorithm. 

3. The merging and linking together of individual reaction 
extractions. 

4. The determination of reactions relevant to the user. 

5.1 Document The challenge of retrieving full-text articles has long held back 

Retrieval biomedical text-mining. All article titles and abstracts can be 

obtained using the mature and stable E-Utils API provided by the 
NCBI [1]. As the API allows article records, containing the article 
abstract, to be retrieved in bulk and in a common format, early text¬ 
mining work in the biomedical community concentrated on the 
mining of these easily obtainable abstracts. While mining abstracts 
can return important data (as the significant findings of a paper will 
be repeated in the abstract), a great deal of potential useful data is 
only found in the full article. There has been a clear move towards 
developing tools using full-text articles with the BioCreative III 
tasks using corpora of full-text articles for the competition [33]. 

Unfortunately, publishers have been reluctant to allow their 
publications to be mined. While it is possible to retrieve content 
programmatically from the publishers’ websites [51] (known as 
“screen scraping”), typically the Robots Exclusion Standard of 
most publishers’ websites disallows access to screen scraping tools 
(with the exception of search engine spiders, such as the Google- 
bot). While the rules set by the robots.txt file are purely advisory 
and rely on the cooperation of the spider, web administrators can 
block access if they wish (see Note 2 ). 

Here we will discuss the use of the NCBI E-Utils API to 
retrieve relevant abstracts. 
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1. The user specifies a pathway (or multiple pathways) of interest, 
by MetaCyc ID(s), and an organism of interest, by NCBI 
Taxonomy ID. 

2. A series of PubMed queries (one for each metabolite found in 
the “seed” pathway(s)) are constructed from this user-supplied 
information. For instance, consider the user specifies MetaCyc 
pathway “glycolysis I (from glucose-6P)” and the organism 
Mycobacterium tuberculosis. The following query would be 
constructed for the metabolite pyruvate (truncated lists of 
synonyms are used for presentation purposes): 

(("M.tuberculosis"[All Fields]) OR ("Mycobacterium 
tuberculosis" [All Fields]) OR 

("Bacterium tuberculosis"[All Fields])) AND 
(("pyruvate"[All Fields]) OR 

("pyruvic acid"[All Fields]) OR ("alpha-ketopro- 
pionic acid" [All Fields] ) ) 

Organism synonyms are retrieved using the dictionary 
provided by the LINNAEUS organism NER library (which is 
based on the NCBI Taxonomy database) [50]. Pathway data is 
retrieved using the BioCyc API, while metabolite synonyms are 
retrieved from a local ChEBI database [52]. It is not advisable 
to include all metabolites in the query—currency molecules, 
such as ATP and ADP\ should be excluded as they are not 
meaningful in the identification of pathways and may lead to 
the retrieval of many irrelevant articles. 

3. Use the NCBI E-Utils API to find articles matching each 
query. The API uses a REST interface and PubMed can be 
queried using the following general URL: 

http:// eutils.ncbi.nlm.nih.gov/ entrez/eutils/esearch.fcgi 

The following general parameter string can be supplied as 
either GET or POST parameters (see Note 3), where {term} 
is the search query and {retmax} is the maximum number of 
articles to retrieve: 

db=pubmed&terrn={term} &retmax={retmax} 

The request will return an XML document containing a list of 
PubMed IDs corresponding to articles matching the query (see 

Note 4). 

4. The metadata for each article can be retrieved in bulk using the 
following URL: 

http:// eutils.ncbi.nlm.nih.gov/ entrez/eutils/efetch.fcgi 

The following general parameter string can be supplied as 
either GET or POST parameters, where {id} is a PubMed 
ID retrieved in the previous step: 
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5.2 A Heuristic 
Metaboiic Reaction 
Extraction Method 


5.3 Forming 
Networks from 
Individuai Metabolic 
Reactions 


db=pubmed&r etmode=xml&id={id} , {id} 

The request will return an XML document containing meta¬ 
data (such as the title, authors, and abstract) for each PubMed 
ID supplied (see Note 5). 

Czarnecki et al. describe a pattern-based method for extracting 
individual metabolic reactions from text [53], which was imple¬ 
mented in this pipeline. Proteins and small molecules are recog¬ 
nized using the NER tools BANNER and OSCAR4, respectively. 
Patterns (such as substrate-product or enzyme-product-substrate) 
are assigned to individual sentences and each pattern assignment is 
scored regarding the match of entities extracted by BANNER and 
OSCAR4 and also words occurring between the entities. For 
instance, if two small molecules are assigned as substrate and prod¬ 
uct, the substrate would be expected to be preceded by a reaction 
word, while a production word would be expected to occur 
between the two entities (see Note 6). Other words such as variants 
of the verb catalyze, prepositions (e.g., to, from , and by), and the 
coordinating conjunction and are also accounted for. A training set 
was used to calculate appropriate scores for these individual factors, 
which are added together to create a final assignment score. The 
highest scoring assignment, if greater than a given threshold, is 
returned to the user. 

The previous stage results in a list of putative individual metabolic 
reactions. Before a network can be formed, reactions must first be 
assigned to the correct host organism (see Note 7). We developed a 
simple heuristic method—similar to entries in the gene normaliza¬ 
tion task of BioCreative III [33]. Based on a small development 
corpus of 30 documents, we propose the following simple rules in 
order of priority: 

1. If an organism is mentioned within a reaction sentence, the 
reaction is associated with this organism. If multiple organisms 
are mentioned, the reaction is assigned to all. 

2. If the previous point does not apply, the reaction described will 
belong to the first organism mentioned in the paper. 

Multiple individual extractions may describe the same meta¬ 
bolic reactions, but extracted from separate sources (or from sepa¬ 
rate sentences within the same source). Working out that multiple 
extractions refer to the same reaction is not trivial, however, as 
different sources may use different names for the metabolites and 
not all metabolites may be included (particularly side-metabolites, 
such as ATP and ADP). 

Solving both of these problems requires linking metabolite 
names to common identifiers (such as SMILES and InChl). Fortu¬ 
nately there are a number of databases that catalogue small 
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molecule synonyms—the largest of which, PubChem [54], has 
catalogued approximately 50 million compounds. As a great many 
reactions may be extracted by the algorithm using web services to 
retrieve the InChl for a given small molecule would be unsuitable. 
Rather, a local database should be utilized. Here we describe the 
use of ChEBI [55] —a smaller database than PubChem, more 
focused on metabolic pathways. 

1. The entire ChEBI database can be downloaded as a series of 
tab-delimited files. The three files containing data necessary for 
this task are compounds . tsv, names . tsv, and inchis . tsv. 
These files can be simply read into an SQL database (such as 
MySQL) and the synonyms indexed. More complex searching 
is required, however, as names used by authors may not corre¬ 
spond exactly the those held in the database (for instance, 
punctuation is often variable in small molecule names). 

2. While complex querying can be provided through the imple¬ 
mentation of a full search index (see Lucene for a standard Java 
solution), a simpler solution is to pregenerate variants that 
disregard elements that are likely to be variable in small mole¬ 
cule names. Consider the small molecule aldehydo-v-glucose 6- 
phosphate(2—). The following variants can be generated: 

(a) Remove round and square brackets with their content. 
aldehydo-D -glucose 6-phosphate 

(b) Remove stereochemistry identifiers. 
aldehydo-glucose 6-phosphate 

(c) Remove all whitespace. 
aldehydo-glueose6-phosphate 

(d) Remove non-word characters (the set of word characters 
contain the 26 letters, 10 numbers, and underscore). 

aldehydoglueose6phosphate 

(e) Remove any non - letters. 
aldehydoglueosephosphate 

3. Putative small molecules extracted by OSCAR4 undergo the 
same variant generation and the variants (a) to (e) are queried 
until a match is found. If a match is found the putative metab¬ 
olite is assigned to the corresponding InChl. 

4. Determining whether two extracted reactions are referring to 
the same reaction is not trivial even once InChls have been 
assigned to all the metabolites. While two extractions contain¬ 
ing the exact same metabolites can safely be merged, consider 
the following two reactions: 


A 

A + B 


C 

C 
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5.4 Assessing 
Reaction Reievance 


There are two different ways of merging the reactions: 

• Taking the union (creating a reaction including all metabo¬ 
lites from all the merged reactions)—this is the correct 
approach when B is a correct extraction, but is a side metab¬ 
olite (such as ATP) which is not always included. 

• Taking the intersection (creating a reaction including only 
those metabolites found in all the merged reactions)—this is 
the correct approach if B is an incorrect extraction. 

5. We have evaluated both methods, but found weaknesses with 
both. Ultimately we decided to only merge reactions that were 
exactly the same, but other strategies may be viable. 

6. While joining reactions together to form pathways is usually 
trivial (i.e., if the product of one reaction has the same InChl as 
a substrate of a different reaction, the reactions can be joined), 
currency molecules can be problematic. Currency molecules, 
such as ATP, tend to form a small number of highly connected 
nodes which the rest of the network clusters around (due to 
their involvement in many unrelated reactions). We use a man¬ 
ually curated list of currency molecules and recognized each 
mention of a particular currency molecule as unique entities 
(see Note 8). Therefore, completely separate reactions that 
both happen to convert ATP to ADP will not be linked 
together. There are problems with using a static list to identify 
currency molecules, however, as a metabolite’s status as cur¬ 
rency or non-currency can depend on context. While aeetyl- 
coenzyme A is often confined to side reactions, it is an integral 
metabolite in the TCA cycle—pathways downloaded from 
BioCyc using the API make no distinction between “currency” 
and “integral” metabolites. 

A network created by following the previous steps will typically be 
very large, containing hundreds (and often thousands) of individual 
metabolic reactions. The extracted reactions can be classified into 
three categories: 

• Extractions that do not match the content of the source 
sentence—either a described reaction is extracted incorrectly or 
the sentence does not even describe a reaction. 

• Extractions that accurately represent a described reaction, but 
are irrelevant to the user. 

• Correct extractions that are relevant to the user. 

Unfortunately the third category is invariably in the minority. 
While there are a number of possible methods for determining 
whether an extraction accurately represents a real reaction (for 
example, counting the number of times the reaction is extracted, 
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or finding the reaction in a database of known reactions such as 
BRENDA—^Note 9), here we will focus on possible methods for 
determining whether a given reaction is relevant to the user. 

Initially if the reaction matches a reaction in the seed pathway 
submitted by the user to construct the PubMed queries, the reac¬ 
tion will typically be relevant to the user. We have found, however, 
that a reaction simply containing just a single metabolite found in 
the seed pathway will typically not be relevant—particularly with 
metabolites that are found in many pathways. 

MetaCyc groups pathways together that have the same purpose 
(such as the biosynthesis or degradation of a specific metabolite), 
but in different contexts. Such pathways are distinguished by their 
titles ending with Roman numerals (for instance, “glycolysis I” and 
“glycolysis II” describe glycolysis from two different starting meta¬ 
bolites). Pathways containing an extracted reaction can be found 
and their name compared to those used to construct the seed 
pathway. If a match is found, the extracted reaction is likely to be 
relevant. 

These methods, however, rely on the extracted reaction already 
being known, though not necessarily in the same organism, and 
being present in MetaCyc—the reactions in a novel alternative 
pathway will not be found as relevant. We have identified two 
general properties that correlate with relevance, however, that do 
not require prior knowledge: 

• The number of times a reaction is extracted. 

• The similarity of the set of metabolites in the seed pathway and 
the set of metabolites mentioned in a source document as 
measured using Jaccard Index (see Note 10). 

Depending on the task at hand, however, more properties may 
be identified. For instance, if links between pathways are of interest, 
reactions in an unbroken route connecting a metabolite in the seed 
pathway to that of a pathway of interest could be marked as rele¬ 
vant. Here we consider the identification of alternative pathways 
where one pathway takes a different route between two metabo¬ 
lites. The following steps describe how to correctly weight these 
properties using a training set: 

1. Identify pairs of pathways in MetaCyc from two organisms 
which show different routes between the same two metabo¬ 
lites. Separate a small number out to use as a training set while 
the others will form a test set. 

2. Run the reaction extraction algorithm and network building 
method using one pathway in the pair as a seed pathway, but 
specifying the organism of interest as the host of the other 
pathway. 
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5.5 Outputting the 
Extracted Pathway 


3. Using the other pathway in the pair as gold standard relevant 
reactions, calculate the probabilities of a reaction being relevant 
with regard to the number of times it has been extracted. For 
instance, if from 20 reactions that have been extracted 3 times, 

5 are found in the gold standard relevance pathway, reactions 
extracted 3 times have a 0.25 probability of being relevant. 

4. Calculate the similarity between the sets of metabolites in the 
seed pathway and in each source document using Jaccard 
Index. Sort all extracted reactions by this score (if a reaction is 
extracted from multiple sources, simply take the greatest Jac¬ 
card Index of any of the sources). Calculate the precision of 
relevant reactions (i.e., reactions found in the gold standard 
relevance pathway) in a moving window at each reaction in the 
sorted list. Plot the Jaccard Index assigned to each reaction 
against the precision of relevant items and calculate a curve of 
best fit. The equation of the curve will be used to calculate a 
probability of relevance using a given Jaccard Index. 

5. Find branches connecting metabolites from the seed pathway. 
Starting with such a metabolite, identify each reaction contain¬ 
ing the metabolite as a substrate. Then do the same with the 
products of these identified reactions and so on until another 
metabolite from the seed pathway is discovered, the route 
loops back on itself, or the route simply ends (see Note 11). 

6. As the pairs of metabolic pathways in the training set only have 
one alternative branch each, it is not possible to calculate a 
meaningful probability for whether a given branch in the 
extracted pathway is relevant. Branch relevance is instead cal¬ 
culated from the length of the branch (a route containing only 
two reactions connecting 2 metabolites found in the seed 
pathway is more likely to be relevant than a route containing 

6 reactions) and the correctness of each individual reaction 
(a single incorrectly extracted reaction should lower the rele¬ 
vance of the entire branch). The product of the individual 
correctness probabilities can be calculated to take into account 
both of these factors. 

There are a number of different use cases that demand the extracted 
pathway in different formats. For a developer implementing the 
tool as a library the extracted pathway should be returned in a 
computer readable format, such as the XML-based format SBML 
(for which there are mature Java libraries). An end-user, however, 
would typically prefer the pathway in a human readable format. 
This could entail outputting the extracted reactions (ordered by 
relevance score, for instance) or as a network diagram. While there 
are many network drawing libraries, outputting files that can be 
drawn by a separate package, such as Cytoscape (see Note 12), 
would be trivial to develop and could fit well into the user’s 
workflow. 
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6 Concluding Remarks 

While machine-learning methods typically produce the best perfor¬ 
mance of methods applied to text-mining tasks, heuristic methods, 
such as the method described here, still have an important role in 
the field. In a field such as metabolic pathway extraction, where 
little work has been carried out, heuristic methods can be proto¬ 
typed relatively quickly and can act as a baseline for more sophisti¬ 
cated methods that follow. In addition, machine-learning methods 
rely on the availability of large quantities of marked-up text for 
training. For tasks such as NER and PPI extraction, large corpora 
have been developed which have greatly aided the development of 
methods. With no metabolic reaction corpus currently available, 
however, machine-learning methods are not a serious option. 

Despite this, we have found that the methods described here 
perform strongly—roughly in line with methods used for PPI 
extraction. Nevertheless, there are many opportunities to improve 
the method: 

• We have described the retrieval of abstracts and full-text open- 
access articles from PMC. Unfortunately full-text non-open- 
access articles have traditionally been impossible to obtain (with¬ 
out contravening web scraping rules), limiting the practical use 
of text-mining. There are signs that this is changing, however, 
with publishers such as Elsevier releasing APIs for accessing their 
libraries (although this has not been without controversy). Such 
APIs, or the general CrossRef API, could be implemented in the 
strategy we have described. 

• The assignment of reactions to a specific organism is a key step in 
the pipeline. Incorrect assignments can lead to reactions from 
the wrong organism being included in the results or to the 
hiding of correct information. The organism assignment 
method described here uses a very simple method and was tested 
on a small corpus. While the method generally performs well, 
the significant effect of any errors may warrant the development 
of a more sophisticated method. 

• The assignment of InChls to metabolites is another key step in 
the pipeline. A failure to assign an InChl to a metabolite would 
prevent the reaction from being merged with other occurrences 
of the same reaction and from being linked to other reactions 
through the particular metabolite. While the variant generation 
method we have shown for the fuzzy searching of metabolite 
names performs well with anticipated variants (such as missing 
stereochemistry identifiers), other variants, such as incorrect or 
rare spellings, are not possible to foresee (see Note 13). A full 
search index, provided by a Java library such as Lucene, would 
allow more sophisticated fuzzy matching. 
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• When discussing the relevance of extracted reactions, we have 
focused on the identification of alternative pathways (specifically 
different routes between the same two metabolites). Other reac¬ 
tions may be relevant, however, depending on the user’s needs. 
The user may simply want to identify evidence for an already 
established pathway or they might be interested in links to other 
pathways or metabolites. Consequently, it would be possible to 
develop multiple relevance algorithms and allow the user choose 
the most appropriate. However, a more detailed discussion of 
these options lies outside the scope of this chapter. 


7 Notes 


1. The quantitative assessment of text-mining systems tends to 
involve the use of corpora—collections of documents with 
entities and relationships manually annotated. The system 
being assessed is run on the text within a corpus and the results 
compared to the marked-up elements using the following 
measures: 

Precision The proportion of extracted instances that are cor¬ 
rect extractions. 

_ . . true positives 

Precision =-—-——-—— 

true positives + false positives 

Recall The proportion of relevant instances that are cor¬ 
rectly extracted. 

_ ,, true positives 

Reeall =-—-——- : — 

true positives -{-false negatives 

F 1 -score (Often abbreviated to F-score ) An overall measure 
of accuracy—the harmonic mean of precision and 
recall. 

_ _ precision x reeall 

Fi -seore= 2 x-—-— 

precision + reeall 

2. We have investigated the use of web scraping to retrieve full 
articles and while most publishers did not block access when 
using a very conservative method with long delays, some pub¬ 
lishers blocked us regardless. 

3. While submitting the query as a GET parameter is typically 
easier to code, there is a character limit which is easy to reach 
with a small molecule with many synonyms. POST, however, 
has no such limit. 

4. If the full-text of the article is held in PubMed Central, this 
XML document will contain the PMC ID. This does not 
guarantee that the article is open-access and can be retrieved 
using the E-Utils API, however. 
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5. A similar query can be used to obtain articles from PMC if you 
wish by changing db=pubmed to db=pmc. If the article is open- 
access the returned XML document will contain the full-text 
article. 

6. A reaction word typically refers to the substrate and describes 
the reaction, either specifically (e.g., hydrolysis) or generally 
(e.g., converts). A list of specific reaction words were inferred 
from the naming of enzymes in the Enzyme Classification (for 
instance, alcohol dehydrogenase leads to the reaction word dehy¬ 
drogenates) while the general reaction words were simply com¬ 
plied from example text. Production words (e.g., forms , 
produces) refer to the product. The word lists, in fact, hold 
the word stems so that all variants need not be included. 
Stemming was performed using a Java implementation [56] 
of the standard Porter stemming algorithm [57]. See [53] for 
the full list of words. 

7. While the literature search strategy should retrieve articles rele¬ 
vant to a specific organism, it remains necessary to assign 
individual reactions to an organism as articles rarely mention 
just a single organism. The organism of interest may be men¬ 
tioned in passing in an article abstract while the article deals 
principally with a different organism. Reactions in such an 
article should not be assigned to the organism of interest. An 
article dealing with the organism of interest may compare 
against reactions in other organisms. Such reactions should be 
recognized as not belonging to the organism of interest. 

8. The following molecules are recognized (by their In Chi) as 
currency molecules: NAD+, NADH, NADP+, NADPH, ATP, 
ADP, AMP, C, O, N, H+, C0 2 , and H 2 0. 

9. To develop a correctness measure extract a number of pathways 
using the algorithm and calculate the probability of a reaction 
being correct given one or more features. For instance, con¬ 
sider that the number of times a reaction is extracted is identi¬ 
fied as the sole feature relevant to reaction correctness. If 60 % 
of the reactions extracted once in the training set were correct 
extractions, reactions extracted once would achieve a correct¬ 
ness score of 0.6. 

10. The Jaccard Index measuring the similarity of two sets is calcu¬ 
lated by dividing the number of features common to both sets 
by the total number of features. For instance, if a seed pathway 
contains 10 metabolites and an article mentions 5 of these 
metabolites in addition to a further 20 small molecules not 
found in the seed pathway, the Jaccard Index would be calcu¬ 
lated as follows: 
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J(AB) 


\A n B\ 
\AUB\ 

5 

10 + 25-5 
5 

30 

0.167 


11. While this process can continue until these conditions are met, 
we have found it useful to include a branch length threshold as 
longer branches are more computationally expensive to calcu¬ 
late and are typically not significant. We used a maximum 
branch length of 6. 

12. While Cytoscape can draw SBML files directly, it cannot read 
custom annotations where relevance scores would be stored. 
Reactions should instead by outputted to CSV files (one con¬ 
taining entity attributes and another containing relationships 
between entities) which can be used to read in arbitrary 
attributes. 

13. For instance, consider the small molecule d -glucose 1,6-bispho- 
sphate. The search strategy would identify glucosc-bisphosphute 
as the same molecule, but not d -glucose 1,6-bisphosfate , despite 
the latter’s closer spelling overall. 
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Chapter 9 


Analysis of Genome-Wide Association Data 

Allan F. McRae 


Abstract 

The last decade has seen substantial advances in the understanding of the genetics of complex traits and 
disease. This has been largely driven by genome-wide association studies (GWAS), which have identified 
thousands of genetic loci associated with these traits and disease. This chapter provides a guide on how to 
perform GWAS on both binary (case-control) and quantitative traits. As poor data quality, through both 
genotyping failures and unobserved population structure, is a major cause of false-positive genetic associa¬ 
tions, there is a particular focus on the crucial steps required to prepare the SNP data prior to analysis. This 
is followed by the methods used to perform the actual GWAS and visualization of the results. 

Key words Genome-wide association, SNP cleaning, Population stratification, Imputation, Case- 

control, Quantitative trait 


1 Introduction 


Unlike single gene disorders (such as Cystic fibrosis or Huntington’s 
disease), complex traits (e.g., height, body mass index) and diseases 
(e.g., schizophrenia, Type 2 diabetes) are the result of the combina¬ 
tion of many genes and environmental factors, with each gene variant 
individually affecting an individual’s trait or disease risk by a small 
amount [1]. While we had been successful at the elucidation of the 
genes and genetic variants underlying single gene disorders, no genes 
underlying complex traits had been identified prior to the advent of 
genome-wide association studies (GWAS) [2]. 

GWAS test the association of hundreds-of-thousands or 
millions of single-nucleotide polymorphisms (SNPs) across the 
genome with complex traits and diseases. While the first GWAS 
occurred earlier [3,4], the first large GWAS using SNPs that cover a 
large percentage of the genome came from the Wellcome Trust 
Case-Control Consortium (WTCCC) in 2007 [5]. In the 5 years 
following that, over 2000 loci that were significantly and robustly 
associated with complex traits were identified [6], with the number 
of significant SNP-trait associations surpassing 14,000 in 2013 [2]. 
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The analysis of GWAS data has a number of statistical chal¬ 
lenges and potential pitfalls. In particular, false-positive associations 
can be generated through population stratification and poor quality 
genotyping [7]. This chapter will discuss stringent preparation of 
data for GWAS to avoid these pitfalls and the subsequent analysis 
and visualization of the results. 


2 Data Quality Control and Cleaning 

The most important and time-consuming task in a GWAS is the 
preparation of the data. As the final analysis will be testing the 
association of hundreds of thousands, or millions, of SNPs with a 
phenotype, even small systematic biases can lead to large numbers 
of false-positive results, and potentially false-negative, findings [8]. 
The cleaning process is generally divided into two steps: first 
removing any individuals with poor quality data and then removing 
SNP markers that have substandard genotyping performance. 
Performing the per-individual steps first prevents individuals with 
poor quality genotypes having an undue influence on the removal 
of SNP markers in the later step. 

Quality control at an individual level aims to remove samples that 
have issues affecting their genotypes throughout the genome. This 
will be divided into five steps: (1) removal of individuals with excess 
missing genotypes, (2) removal of individuals with outlying homo¬ 
zygosity values, (3) remove of samples showing a discordant sex, 
(4) removal of related or duplicate samples, and (5) removal of 
ancestry outliers. As the removal of an individual from the analysis is 
costly—both in terms of the cost of the genotyping and the time 
spent preparing the DNA sample—it is important to spend time 
during the initial study design to ensure to the extent possible that 
all individuals are from a common ancestral background and that 
extracted DNA is of high quality. 

1. Removal of individuals with excess missing genotypes: Modern 
genotyping arrays call the genotype at a SNP by comparing the 
intensity of florescence of the two alleles [9, 10]. Clusters are 
formed for individuals with high intensity for the A allele and 
low intensity for the B allele, for individuals with high intensity 
for the B allele and low intensity for the A allele, and for those 
with intermediate intensities for both alleles. Individuals falling 
in these clusters are given the genotypes AA, BB, and AB, 
respectively (Fig. 1). Any individual falling outside these geno¬ 
type clusters is given a missing genotype. Large numbers of 
missing SNP calls for an individual indicate that the genotype is 
failing to fall into any of these clusters, which can be caused by 
low quality or concentration of the DNA used for genotyping. 


2.1 Per-Individual 
Quality Control 
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Fig. 1 Example of SNP genotype clustering. Three distinct classes of SNPs can be seen corresponding to the 
AA, AB, and BB genotypes. A few individuals (colored gra]/) fall outside of these clusters and are unable to be 
assigned a genotype 


Samples with a high missingness rate also tend to have higher 
genotyping error in the genotypes that are called, so need to be 
removed completely from analysis. Typically, a threshold in the 
order of 5 % missingness is used to determine which samples 
need to be removed; however, an appropriate threshold should 
be determined for each experiment by looking at the distribu¬ 
tion of missingness across samples and removing the outliers. 
This step is particularly important when using a case-control 
design, especially when the DNA extraction was performed 
separately for cases and controls, as differential genotype qual¬ 
ity may correlate with disease status and thus introduce a bias to 
the analysis [7]. 

2. Removal of individuals with outlying homozygosity values: 
The proportion of homozygous (or inversely heterozygous) 
genotypes across an individual’s genome (excluding sex chro¬ 
mosomes) can detect several issues with genotyping. Average 
heterozygosity correlates with genotype missingness such that 
samples with high missingness tend to have lower average 
heterozygosity, although a reduction in heterozygosity can 
also reflect inbreeding. Sample contamination, where multiple 
samples are accidentally genotyped on a singe array, results in 
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high average heterozygosity. The average value of the propor¬ 
tion of heterozygous genotypes will vary across populations 
and genotyping platforms and as such the high and low thresh¬ 
olds for sample removal need to be determined by examining 
the distribution in your cohort. 

3. Removal of samples showing a discordant sex: Determining 
whether an individual is male or female is straightforward 
from genotyping array data. Because males only have a single 
copy of the X chromosome, they cannot be heterozygous for 
any markers (outside the small psuedo-autosomal regions at 
the end of the chromosome). Depending on the genotype¬ 
calling algorithm, males may have a few heterozygous geno¬ 
types due to the low background genotyping error rate, 
although some platforms consider these as missing values. 
Starting with females, individuals with low heterozygosity 
across the X chromosome are indicative of a sample mix-up 
with a male. For males, samples with high heterozygosity—or 
excess missingness—are likely to be females. Provided males 
and females have been randomly placed on plates for genotyp¬ 
ing, patterns of mismatching sex can be used to rectify potential 
plating errors. 

4. Removal of related or duplicate samples: When using popula¬ 
tion cohorts for GWAS, it is important to exclude related or 
duplicated individuals from the analysis. While the case for 
removing duplicated individuals is obvious, even individuals 
with a relatively distant relationship can bias the analysis. For 
example, if we have two related cases in a case-control analysis, 
their genotypes being on average more similar to each other 
than the rest of the cohort will provide a slight bias to the 
estimate of the allele frequency in cases and its associated 
standard error [8]. Even this small bias is important when 
considering the number of statistical tests being performed. 
Duplicate and related individuals are detected using the Iden- 
tity-by-State (IBS) metric, which measures the average propor¬ 
tion of alleles shared by two individuals across the autosomal 
genome. The IBS measure can be converted to a measure of the 
degree of relatedness between a pair called Identity-by-Descent 
(IBD). IBD is interpreted as the proportion of the genome that 
is shared between two individuals. For duplicate samples, or 
monozygotic twins, we expect that IBD = 1, IBD = 0.5 for 
first-degree relatives and 0.25 for second-degree relatives. A 
maximal threshold of IBD = 0.1875 (which is halfway 
between that expected of second- and third-degree relatives) 
is common [8], although selecting a smaller threshold that 
removes outlying pairs from your cohort is warranted. If any 
pair of individuals is found to be related, it is best to remove the 
one with the lowest genotyping rate as determined earlier. 
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22 Per-Marker 
Quality Control 


5. Removal of ancestry outliers: Population stratification is a 
major source of bias in GWAS, as it is common for disease or 
quantitative traits to have different frequencies or distributions 
across populations [11]. This is a particularly important con¬ 
sideration for case-control studies although quantitative trait 
studies are also affected. For example, Campbell et al. [12] 
performed an association analysis on two groups of individuals 
of European descent that were discordant for height and iden¬ 
tified an association with the LCT locus. This locus has histori¬ 
cally undergone strong selection in certain European 
populations, resulting in the frequency of its variants differing 
significantly across the populations who also differed in average 
height. While this may be considered an extreme example, even 
small amounts of population stratification in an apparently 
homogeneous population can bias association results. 

The most common method used to detect ancestral outliers 
is principal component analysis (PCA), although other meth¬ 
ods such as multidimensional scaling can equally be used. 
Principal component analysis is a multivariate statistical method 
that produces a set of uncorrelated variables (principal compo¬ 
nents) such that the first principal component explains the most 
variation in the data, followed by the second, and so on. Using 
a reference population with large ancestry differences ensures 
(at least) the first two principal components reflect major ances¬ 
tral differences, with a commonly used reference being the 
HapMap data [13] that includes populations from Africa 
(YRI), Asia (CHB + JPT), and Europe (CEU). When calcu¬ 
lating principal components it is important to first filter the 
SNPs into an approximately independent subset of approxi¬ 
mately 50,000 SNPs to avoid undue influence of regions with 
high linkage disequilibrium. 

Once the principal components are generated from the ref¬ 
erence populations, the PCA model is applied to the study 
cohort to calculate their principal component values, allowing 
them to be clustered alongside the HapMap individuals 
(Fig. 2). Any outlying individuals should be removed from 
the analysis, with a typical threshold being any individuals 
that are further than four standard deviations away from the 
cluster mean. 

The second stage of genotype cleaning involves looking at individ¬ 
ual SNPs to determine genotype accuracy. As discussed earlier, 
genotypes at each SNP are typically called by clustering individuals 
into three clusters representing AA, AB, and BB genotypes. Ideally 
each individual cluster plot would be investigated to confirm the 
separation of clusters and thus accuracy of genotype calling. How¬ 
ever, this is impractical with current genotyping arrays that have 
hundreds of thousands of SNPs. Instead, statistical metrics of SNP 
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Fig. 2 Principal component analysis to detect ancestral outliers in a study population. The four HapMap 
populations representing the three major ancestral groups are used as anchors to analyze the study population 
(black crosses). In this example, the majority of the study population is of primarily European ancestry and is 
clustered around the CEU population. A few individuals show admixture with a proportion of Asian ancestry 
and need to be removed before performing the GWAS analysis 

quality are used to determine whether to remove them from the 
analysis. While every SNP removed is potentially a missed associa¬ 
tion, an attempt will be made to recover them using imputation in 
the next section. This stage will be divided into four steps: (1) 
removal of SNPs with excess missing genotypes, (2) removal of 
SNPs that deviate from Hardy-Weinberg equilibrium, (3) removal 
of SNPs with low minor allele frequency, and (4) comparing minor 
allele frequency to known values. 

1. Removal of SNPs with excess missing genotypes: Poor separa¬ 
tion of clusters during genotype calling will result in an increase 
in the number of individuals falling into a region that is not 
clearly in one genotype cluster or another and thus having their 
genotype called as missing. A SNP should be excluded from 
further analysis if it has greater than 5 % missing genotypes, 
although more stringent thresholds have also been applied. 

When analyzing case-control association studies, a second¬ 
ary check of the missing data rates is required. As missingness 
can be nonrandom with respect to the underlying genotype, 
differential missing genotype rates between cases and controls 
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can lead to false-positive results. Any SNP showing a difference 
in missingness with p < 10~ 6 should be removed from further 
analysis. This step is particularly important if the case and 
control cohorts have been collected at different times as there 
may be a difference in DNA quality between the cohorts. 

2. Removal of SNPs that deviate from Hardy-Weinberg equilib¬ 
rium: The principle of Hardy-Weinberg equilibrium (HWE) 
provides us expected genotype frequencies given an allele fre¬ 
quency. Poor genotype calling, including issues due to poor 
cluster separation, can result in deviations from genotype fre¬ 
quencies expected under HWE [14]. Typically a threshold of 
p < 10‘ 6 is used to exclude SNPs from further consideration. 
One caveat is that selection can also cause deviations from 
HWE, and thus exclusion due to this could result in missing 
important disease associations [15]. Therefore, in case-control 
studies, only the controls should be used to screen for devia¬ 
tions from Hardy-Weinberg. 

3. Removal of SNPs with low minor allele frequency: Typically, 
any SNPs with a minor allele frequency less than 1 % are 
removed from the analysis. SNPs at this frequency have very 
few individuals with heterozygous genotype or the rare homo¬ 
zygous genotype and thus it is difficult to determine accurate 
genotype clusters when calling genotypes. Also, there is low 
power to detect associations for SNPs at low frequencies and 
alternative strategies for association testing combining multiple 
variants should be used [16]. 

4. Comparing minor allele frequency to known values: A final 
check of genotype quality is to compare the minor allele fre¬ 
quency of the called genotypes to highly genotyped cohorts 
such as the HapMap [13] and 1000 Genomes samples [17]. 
While there will be difference in allele frequency between the 
two cohorts due to ancestry differences and sampling variation, 
an overall similarity in allele frequencies ensures the annotation 
of the SNPs being analyzed is correct and individual deviations 
are a sign of genotyping error. 


3 Imputation 


Genotype imputation is the process of predicting, or imputing, 
genotypes that are not known in a sample of individuals. This uses 
a reference panel of densely genotyped individuals to impute SNP 
genotypes for a study set that have only been genotyped at a subset 
of those SNPs. Predicting these unobserved SNPs increases the 
amount of SNPs that can be tested for association, which in turn 
increases the power of the study and the ability to fine-map the 
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causal variants [18]. It also facilitates a meta-analysis of multiple 
cohorts by ensuring that all cohorts have a common set of SNPs. 

Genotype imputation generally is performed in two steps. First, 
the samples being imputed are phased—that is their genotypes are 
separated into their two component haplotypes. While this step is 
not strictly necessary for some imputation software, performing the 
haplotyping step separately can improve overall computational per¬ 
formance. The second step is to fill in missing genotypes in the 
target sample using a reference panel of haplotypes with a dense set 
of SNP. The basic principle relies on finding sections of a haplotype 
in the reference population that share SNP alleles with the haplo¬ 
type being imputed, which indicates that region of the genome is 
descended from a common ancestor and thus carries the same SNP 
alleles across the whole haplotype. 

Clearly, the larger the reference panel of haplotypes, the greater 
chance of finding a matching haplotype and the more accurate the 
imputation will be. Initially the HapMap populations were used as a 
reference for imputing, but larger reference samples are continu¬ 
ously becoming available, including the 1000 Genomes individuals. 
Even when imputing a sample from a single ancestral background, 
the imputation accuracy is improved by using a reference panel 
covering many ancestral backgrounds [19]. 

After imputation, each SNP is given a set of probabilities of 
having the three possible genotypes. A number of measures have 
been proposed to describe the accuracy of imputation at a given 
SNP, although there is a strong correlation between each of them. 
These statistics provide an R 2 measure that lies within the range of 
0-1, with a value of 1 indicate there is no uncertainty in the 
imputed genotypes and 0 indicating that imputation provided no 
information on that SNPs genotypes. A SNP imputation R 2 value 
in a sample of N individuals has the equivalent power to having 
N x R 2 actually genotyped individuals [18]. Any SNPs with poor 
imputation accuracy are typically filtered out before any association 
analysis. The threshold used depends whether the genotypes prob¬ 
abilities are being used in the analysis, or whether a “best-guess” 
genotype will be used by selecting the genotype with the highest 
probability. When using genotype probabilities, a low threshold of 
R 2 >0.3 can be used, although there is no bias introduced by 
including all SNPs. A much more stringent threshold of at least 
R 2 >0.8 should be used when considering best-guess genotypes. 
If using best-guess genotypes, it is important to repeat the filtering 
for Hardy-Weinberg equilibrium after the imputation. 


4 Association Testing 

After preparing the genetic data, performing the actual association 
analysis is relatively straightforward. At each SNP in the genome, a 
simple statistical test is performed to assess the association between 
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4.1 Quantitative 
Traits 


4.2 Disease Traits 


the SNP and trait of interest. The analysis of quantitative traits and 
disease (or binary) traits will be considered separately. 

Genetic association analysis for quantitative traits is performed 
using a linear regression. Typically a simple model is considered, 
where each SNP is encoded 0,1, or 2 representing the number of B 
alleles in the genotypes AA, AB, and BB, respectively. For imputed 
SNPs the equivalent value is calculated by taking 2 x P BB + P Tb 
where P BB and Pab are the probabilities of genotypes BB and AB, 
respectively. This is referred to as the additive model of association 
as each copy of the B allele is represented as adding to the trait 
value. While it is possible to test more complex modes of genetic 
inheritance, in general we do not know the mechanism of inheri¬ 
tance at a SNP a priori and in this case the simple additive model is 
usually used. 

Genetic association at each SNP in the genome is tested by 
performing a linear regression of the trait value of the SNP value. 
The use of a regression framework allows the inclusion of any 
known covariates that may affect the trait, such as age or sex. Due 
to the large number of statistical tests being performed, it is partic¬ 
ularly important that the assumptions underlying the regression 
model are satisfied. Of particular importance, the normality of 
residuals needs to be ensured under the null hypothesis of no 
association. If needed, this can be achieved by rank normalizing 
the trait data, in which the trait is scaled such that it becomes 
normally distributed. If covariates with large effects are included 
in the analysis, it is preferential to correct for these in a linear 
regression first and then ensure the normality of the residuals 
from this model. 

Genetic association tests for disease traits test whether the propor¬ 
tion of B alleles at a SNP differs between cases and controls. This 
tests for a multiplicative model of association, where each copy of 
the B allele increases the risk of developing the disease by a factor r 
for each B allele carried, i.e., a baseline risk of b for genotype AA, a 
risk of br for genotype AB, and a risk of br 1 for genotype BB. Again 
more complex models could be applied to the data, but the true 
inheritance model is generally unknown and thus multiplicative 
models are used in the first instance. 

Testing for association at a SNP can be done using a simple chi- 
square contingency table test with a 2 x 2 matrix containing the 
counts of A and B alleles for cases and controls in each row. A 
Fisher’s Exact Test could also be used, although cell counts are 
generally large enough to use the chi-square approximation given 
the filtering of rare SNPs that was performed during the data 
cleaning. For imputed data, the number of B alleles in an individual 
can be generated as earlier for quantitative traits and the total across 
all cases and controls taken. This will end up with noninteger 
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4.3 Significance 


4.4 Visuaiization of 
Results 


number counts for each cell in the contingency table, meaning the 
Fisher’s Exact Test cannot be used. If covariates are to be included 
in the analysis, a logistic regression framework is used instead. 
When no covariates are used, a logistic regression is statistically 
equivalent to the contingency table model. 

It is important to correct for the large number of tests performed in 
a GWAS study when assessing the significance of a result. Correct¬ 
ing for the number of SNPs tested using (e.g.) a Bonferroni cor¬ 
rection is overly conservative due to the linkage disequilibrium 
between SNPs, particularly when using imputed data. It has been 
shown that a significance threshold of 5 x 10 -8 corrects for the 
effective number of independent tests genome-wide [20]. A less 
stringent threshold of 1 x 10 -5 is widely used to indicate “sugges¬ 
tive” significance, although many results are expected to achieve 
this level of significance in a typical GWA study. 

GWAS results are typically represented using a Manhattan plot, 
with genomic locations along the X-axis and the negative logarithm 
(base 10) of the ^-value along the T-axis, with each point signifying 
an individual SNP (Fig. 3a). The SNPs with the strongest associa¬ 
tions will have the greatest negative logarithms and will tower over 
the background of unassociated SNPs—much like skyscrapers 
in the Manhattan skyline. This plot provides an additional check 
on the quality of the association test, as multiple SNPs should be 
contributing to each peak, especially when using imputed data. A 
single outlying SNP can indicate poor quality genotyping and the 
initial clustering of the genotypes for that SNP should be inspected. 
In addition, a Manhattan plot showing significant points occurring 
across the genome should be considered suspect and the presence 
of confounder effects such as genotyping batch effects (particularly 
for case-control studies), undetected relatedness and sample dupli¬ 
cations, or population stratification will need to be reassessed. 

A QQ plot is a common way to demonstrate the lack of con¬ 
founding effects. In this plot, the ordered observed negative loga¬ 
rithm of the p-values is plotted against the expected distribution. 
Ideally, the points in the plot should align along the X = T line, 
with deviation at the end for the significant associations (Fig. 3b). 
One way to quantify the lack of global inflation in the QQ plot is 
the genomic inflation factor (2 GC ). This is calculated by determin¬ 
ing the median p -value of your GWAS test statistics, and calculating 
the quantile in a chi-squared distribution with one degree of free¬ 
dom that would give this p- value. This is divided by the median of a 
chi-squared distribution with one degree of freedom (0.4549), to 
give 2 GG . Deviations of this value away from 1.0 indicate genome¬ 
wide confounding in the data. 
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Fig. 3 Example (a) Manhattan and (b) QQ plots from a genome-wide association study. The Manhattan plot 
shows a clear association signal on chromosome 9 that passes the genome-wide significance threshold (black 
line). No other significant or suggestive (gray line) signals are present. The QQ-plot shows no deviation from 
the expected line for low p-values, indicating no confounding due to population structure is present 

4.5 Replication of Despite the care taken in ruling out confounding factors in any 

Significant Results study analysis, the gold standard in GWAS is to replicate all signifi¬ 

cant results in a second independent population [21]. When repli¬ 
cating, only individual SNPs need be considered. Thus, the burden 
of multiple testing is much lower and a more modest level of 
significance can be used. 


5 Software 


The majority of the initial data cleaning and association analysis can 
be performed using any statistical package. For example, there are 
several libraries available in the R suite of software. However, it is 
advantageous to use software specifically designed for the genetic 
association analysis. One of the most widely used pieces of software 
in this field is PLINK [22], which has functions for performing 
both data cleaning and association analyses. PLINK can perform 
multidimensional scaling to investigate ancestral outliers, otherwise 
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the principal component analysis can be performed using EIGEN- 
SOFT [23]. A range of software packages is available to perform 
haplotyping and imputation, with several of the leading software 
implementations being roughly equivalent. One potential combi¬ 
nation that provides robust, high-quality imputed genotypes is 
using SHAPEIT for haplotyping [24, 25] and IMPUTE2 for the 
imputation [19, 26]. 
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Chapter 10 


Adjusting for Familial Relatedness in the Analysis 
of GWAS Data 

Russell Thomson and Rebekah McWhirter 

Abstract 

Relatedness within a sample can be of ancient (population stratification) or recent (familial structure) origin, 
and can either be known (pedigree data) or unknown (cryptic relatedness). All of these forms of familial 
relatedness have the potential to confound the results of genome-wide association studies. This chapter 
reviews the major methods available to researchers to adjust for the biases introduced by relatedness and 
maximize power to detect associations. The advantages and disadvantages of different methods are 
presented with reference to elements of study design, population characteristics, and computational 
requirements. 

Key words Genome-wide association studies, GWAS, Relatedness, Confounding, Population stratifi¬ 
cation, Cryptic relatedness, Familial structure 


1 Introduction 


Genome-wide association studies (GWAS) represent an effective 
means of identifying genetic variants associated with disease risk, 
and increasing sample sizes allow variants of diminishing effect size 
to be identified. However, GWAS results can be confounded by 
population stratification, in which ancestry differences result in 
systematic differences in allele frequencies leading to spurious asso¬ 
ciations, and familial relatedness, in which the presence of related 
individuals within the sample have the potential to violate the 
assumptions of common analytical tools and to artificially inflate 
test statistics leading to false positives. Familial relatedness can be 
further categorized into familial structure, in which relationships 
are known and pedigrees can be constructed, and cryptic related¬ 
ness, in which relationships are unknown. 
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It can be argued that population stratification and cryptic relat¬ 
edness are really two aspects of a single confounder: unknown 
relationships between study participants [1]. The difference 
between these two ideas is scale; while population structure refers 
to the effect of ancient relatedness between groups of participants, 
cryptic relatedness is unknown but relatively recent relatedness 
between individual participants. However, it has been argued that 
population structure and cryptic relatedness as such are not the 
source of confounding, but rather these issues are proxies for the 
real source of confounding, in which other causative loci confound 
the estimate of the effect of a given locus, collectively known as 
“genetic background” [2]. In order to avoid the reporting of false 
associations, methods for addressing these potential sources of 
confounding have had to be built into the study design and analyti¬ 
cal phases of GWAS. 

Early studies addressed these issues by recruiting participants 
from homogeneous populations, making use of pedigree informa¬ 
tion, and removing related individuals from analyses. It became 
increasingly apparent, however, that in order to identify variants 
of moderate or small effect size, as well as rarer variants, much 
larger sample sizes were necessary [3]. As sample sizes have 
increased, ensuring homogeneity has become an increasingly unre¬ 
alistic goal, rendering these early methods based on study design 
less viable. 

There are also many circumstances in which pedigree informa¬ 
tion is unobtainable or unreliable, introducing the problem of 
cryptic relatedness. This is also a problem for studies undertaken 
in population isolates, which are particularly useful in order to 
increase power to detect rare variants, as well as for undertaking 
homozygosity analyses to identify recessive variants [4, 5]. While 
population isolates tend to exhibit greater phenotypic, environ¬ 
mental, and genetic homogeneity, as well as increased rare allele 
frequency resulting from bottlenecks, they also highlight the need 
to mitigate the confounding effects of relatedness [6]. Similarly, 
family based designs are robust to population stratification, but 
need to account for relatedness. 

For all these reasons, there have been substantial efforts 
directed at developing statistical methods to correct for the biases 
introduced by relatedness within a sample. Initially, simple correc¬ 
tive techniques were employed for population stratification, such as 
genomic control and principal components analysis (PCA). Geno¬ 
mic control has been widely used both to identify and to correct 
inflation of test statistics resulting from population stratification 
[7], whereas PCA uses genotype data to identify the most impor¬ 
tant dimensions of genetic variation, which can then be used to 
adjust the data to account for the population structure they 
describe [8]. The problem with these approaches is that they are 
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2 Methods 

2.1 Association 
Using Pedigree Data 


aimed primarily at addressing population structure and are less 
effective at addressing familial relatedness [9]. 

While these methods remain useful in certain circumstances, 
they have since been superseded by a range of more computation¬ 
ally sophisticated methods that have the additional benefit of also 
addressing the problem of cryptic relatedness and family structure. 
This chapter examines the more widely used tools and compares the 
utility of these methods in terms of their advantages and limitations 
for different study designs. 


In the past, genetic association was carried out using population 
samples, and scientists with familial data carried out linkage studies. 
Linkage methods involve searching for segments of chromosome 
that segregate with a trait of interest in families. This approach has 
proven useful for Mendelian diseases and rare variants in complex 
diseases [10]. However, it has proven less effective for identifying 
common variants in complex disease, leading to efforts to develop 
approaches for combining association and linkage modeling meth¬ 
ods in pedigree data [11]. One such method is contained in the 
software LAMP (Linkage and Association Modeling in Pedigrees) 
[12], which implements joint linkage and association using a maxi¬ 
mum likelihood approach. Through this, it is possible to test 
whether SNPs within a linkage signal are in linkage disequilibrium 
with the putative disease allele. This can reduce the chromosomal 
area of interest, from an often quite broad linkage peak. 

The first family-based association method was developed for 
nuclear families and dubbed the Transmission-Disequilibrium Test 
(TDT) [13]. The TDT can detect linkage only when genetic asso¬ 
ciation is present. While association can be observed through the 
confounding, linkage is unaffected and so the TDT is robust to 
population structure. There are a suite of methods that extend the 
TDT to larger pedigrees, known collectively as the Family Based 
Association Tests [14]. While the methods are designed for nuclear 
family data, they can be used on large pedigree data. This is 
achieved by treating nuclear families within pedigrees as indepen¬ 
dent, under the null hypothesis of no association between the 
marker and trait of interest. These methods, like the TDT, allow 
for population stratification between families. Family based associ¬ 
ation testing is implemented in the software FBAT. 

Variance component (VC) methods have a long history of use 
in the analysis of pedigree data in human quantitative genetics, 
animal genetics and animal breeding. They can be used to measure 
the genetic influence on a continuously varying quantitative trait, 
and usually assume that the trait is normally distributed. The idea 
behind these methods is to divide the phenotypic variance into 
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genetic and environmental components, with the result that herita- 
bility, linkage and association can all be determined using pedigree 
data. VC methods rely on a linear mixed effects regression model, 
where the non-independence among family members is accounted 
for by modeling the variance structure of the relationships between 
individuals as a random effect. As it is based on a linear regression 
model, covariates (like age and sex), as well as relatedness, can be 
adjusted for. However, VC methods do not account for population 
structure. 

SOLAR (Sequential Oligogenic Linkage Analysis Routines) is a 
program designed to carry out linkage and association analyses, 
based on a variance component method [15]. The SOLAR package 
is well maintained and includes many extra features, such as esti¬ 
mating identity-by-descent (IBD), allowing for a household effect, 
multiple trait analyses, and genetic interactions, among many 
others. An eigen-simplification of the calculation of the likelihood 
has been incorporated into SOLAR, which has increased the speed of 
the program, while still calculating exact ^-values [16]. 

Another method that uses pedigree data in this way is based on 
the T/qls statistic [17]. It is an extension of Wqls and CCqls 
statistics; in fact, Mqls stands for “modified” or “more powerful” 
quasi-likelihood score, giving its name to the program, MQLS. 
These methods calculate a statistic based on the difference in the 
allele frequencies of the cases and controls, using a quasi-likelihood 
method with a j 2 distribution on 1 degree of freedom under the 
null hypothesis. The T/qls statistic provides greater weight to 
individuals with closely related disease-carrying relatives, thus pro¬ 
ducing an even more powerful test. This method has been shown to 
be more powerful than other VC methods for a dichotomous trait 
(such as case status); however, the disadvantage is that it is not 
possible to adjust for covariates. GLOGS (Genome-wide LOGistic 
mixed model/Score test) also uses a quasi-likelihood method for 
dichotomous traits, while adjusting for relatedness [18]. This 
approach is based on a logistic regression model, and it can be 
used to adjust for up to three covariates while testing for 
association. 

MAS TOR uses a variance components method to adjust for 
relatedness from the known pedigrees [19]. This method is akin 
to the MQLS method, but for detecting association with a continu¬ 
ous trait. Unlike MQLS, it is possible to adjust for covariates. The 
advantage MAS TOR has over an earlier method, known as GTAM, is 
that, similar to the MQLS method, it can account for missing phe¬ 
notype, genotype or covariate data, even when missingness is 
related to the trait of interest, and exploiting the relatedness within 
the sample to model heritability and increase power. Approaches 
like these are useful for studies containing participants with known 
relatedness as well as unrelated participants. 
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22 Association with 
Adjustment for Cryptic 
Relatedness and 
Population Structure 


There are numerous methods that use variance components to 
adjust for relatedness by using genetic data to infer relatedness 
between individuals. These methods can also adjust for population 
stratification and are sometimes called genetic control methods. All 
methods described in this section are based on variance compo¬ 
nents methods, using a linear mixed effect model and assuming a 
continuous trait. They can also be applied to a dichotomous trait 
(such as disease status), under the assumption of a threshold- 
liability model [20]. This assumes that, while the observed disease 
status is dichotomous, the unobserved disease liability follows a 
normal distribution and that there is a linear transformation 
between the two. However, when analyzing a dichotomous trait 
with a variance component method, it is not possible to calculate 
the standard effect size of an odds ratio, which is often used for 
pooling comparable data in genome-wide association meta-analyses 
[ 21 ]. 

Variance component methods can be loosely divided into 
approximate and exact methods. Approximate methods are useful 
because of their improved computer speed and memory usage 
compared to their exact counterparts, which become computation¬ 
ally impracticable in larger cohorts. While the approximate meth¬ 
ods maintain the same type I error rates as exact methods, for some 
pedigree structures approximate methods suffer a loss of power 
[ 22 ], 

Nevertheless, there have been some advances in the speed of 
exact algorithms. For example, the software GEMMA and FaST-LMM 
are able to obtain the same ^-values as the EMMA method, but within 
a fraction of the time [22-24]. The GEMMA method fits a Bayesian 
sparse linear mixed model (BSLMM) using Markov chain Monte 
Carlo (MCMC) for estimating the proportion of variance in phe¬ 
notypes explained. Similarly, the BOLT-LMM method uses a Bayesian 
mixed model to more accurately model genetic architecture, 
thereby increasing the power to detect associations in larger 
cohorts with reduced time and memory demands [25]. 

The simplest method to implement is the GRAMMAR-Gamma 
algorithm [26]. It fits a linear mixed model using the kinship matrix 
and the phenotype of interest for the null model of no association 
between phenotype and genotype. It then uses the residuals of the 
null model to search for an association with genotype using a 
standard linear model. This method is the fastest method available 
and it is easy to implement within the R statistical framework, using 
the library GenABEL [27]. However, it has been shown to have a 
substantial loss of power for more complicated pedigree structures 
[22] . 

There are a number of approximate methods available that 
avoid re-estimating the variance components for each genotype 
by rewriting the model in terms of a single parameter (the ratio of 
genetic variance to residual variance). This keeps the heritability 
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2.3 Known and 
Unknown Relatedness 


2.4 Rare Variant and 
Sequence Analysis 


estimated from the null model fixed when testing each genotype, 
resulting in a substantial reduction in memory and computer run 
time. Methods such as P3D (Population Parameters Previously 
Determined), GCTA, FaST_LMM and EMMAX can implement this 
[28]. GCTA also can be used to estimate the variance explained 
from all SNPs in a genome-wide study, providing the scope to 
predict the trait of an individual in a replication set, based on 
genome wide data [29]. The software that implements the P3D 
method (TASSEL) also includes a method to further reduce com¬ 
puter time (called compression), by clustering the individuals into 
fewer groups based on the kinship among the individuals [30]. 

While FaST-LMM can calculate either exact or approximate p- 
values, it can also capitalize on a few other approximate tricks to 
speed the process up. One is to use a realized relationship matrix 
instead of IBD, so that just one spectral decomposition is needed to 
test all SNPs [31]. Another is to choose a subset of 4000 or 8000 
equally spaced markers to estimate cryptic relatedness. A recent 
simulation study, however, suggests that using a subset of markers 
will adjust for cryptic relatedness, although population stratification 
correction may be compromised as a result [32]. FaST-LMM- 
Select addresses this issue by extending FaST-LMM to include 
principal components of the genotype matrix as fixed effects [33]. 

Finally, there can be a loss of power when using the locus of 
interest to estimate cryptic relatedness. FaST-LMM and GCTA-LOCO 
can overcome this by implementing a method to leave out the 
chromosome containing the candidate locus for estimating genetic 
relatedness. 

Methods that can incorporate both pedigree information and cryp¬ 
tic relatedness are the most successful in accounting for confound¬ 
ing due to population structure and relatedness. Two such packages 
include ROADTRIPS and Mendel. The ROADTRIPS program uses 
the quasi-likelihood methods (implemented in the Mq^ s and simi¬ 
lar statistics), to incorporate both known and unknown relatedness 
[34]. The inclusion of pedigree data, where it is known, increases 
the power of this method to detect association in case-control 
studies, and uses genomic data to estimate a covariance matrix for 
unknown relationships, both recent and ancient. 

Mendel is a software package that was first designed to imple¬ 
ment a number of linkage algorithms. It has evolved over time to 
now include the capability to run linkage, association, gene 
dropping and many other genetic tools. Recently, it has included 
a “Pedigree GWAS” option, that implements a rapid variance com¬ 
ponents method that can adjust for both cryptic and known rela¬ 
tionships in a genome wide association study [35]. 

Genome-wide association studies are usually undertaken using SNP 
array data, with newer chips covering millions of variants, including 
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some rarer SNPs identified in the 1000 Genomes Project [36]. 
Furthermore, as rare variants are likely to contribute substantially 
to heritability of complex diseases and the cost of whole exome and 
whole genome sequencing continues to drop, the issue of main¬ 
taining adequate power while adjusting for familial relatedness in 
association testing of sequencing data becomes increasingly press¬ 
ing. Early work suggests that similar methods will be effective, such 
as combining LMM with a kernel score test to aggregate the effects 
of rare variants within a gene [11, 37]. That is, when investigating 
the role of rare variants on the trait of interest, the assumption is 
that multiple variants within a region have an effect, and it is 
therefore useful to be able to group the rare variants together, 
rather than assess them individually, as in standard GWAS methods. 

Several other methods have been proposed to incorporate 
information across genes or chromosome regions, to test for asso¬ 
ciation with the trait of interest. While it is beyond the scope of this 
chapter to go in to detail of each of these methods, we only list 
them here. Some examples of methods that incorporate relatedness 
adjustment include RHM (regional heritability mapping) with soft¬ 
ware REACTA [38] and VEGAS [39]. The sequence analysis tool, 
VAAST has been extended to allow the use of pedigree data 
(pVAAST) [40]. The extension to the Quasi Likelihood Methods 
for rare variants is implemented in MONSTER [41]. The R library 
that implements the GRAMMAR-Gamma method, GenABEL also 
has a function for rare variant analyses, called cocohet [42]. The 
adjusted SKAT method, ASKAT, is useful for investigating rare 
variants in family-based study designs as it combines the combines 
SKAT and Fast-LMM methods to control for cryptic and family 
relatedness [43]. And finally, the software FBAT, contains a test 
FBAT-v that is specifically designed for rare variants [44]. All of 
the above methods are summarized in Tables 1 and 2 . 


3 Example 

To illustrate, we will work through an example of a genome wide 
association scan using a simulated dataset, where the study partici¬ 
pants are related (see Note 1). 

3.1 Dataset The data consist of 165 individuals from the CEPH (Utah residents 

with ancestry from northern and western Europe) hupmap data set 
[45], including 50 trios, three parent-child relationships and nine 
unrelated individuals. The genotype data contain 65,173 SNPs 
across the genome. 

This data set is designed to emulate a small GWAS study with 
the intention to find evidence for genetic regions that are associated 
with schizophrenia. Rather than a case-control study, this is a 
population-based study, where the subclinical quantitative trait of 
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Table 1 

A list of available methods, with the authors and certain capabilities given 


Method 

Authors 

Trait types 

Adjust for 
covariates? 

Cryptic or 
known 
pedigree 
structure 

FBAT 

NM Laird 

Continuous/ 

Categorical/ 

Time-to-event 

No 

Known 

LAMP 

M Li and G Abecasis 

Categorical 

No 

Known 

MQLS 

T Thornton and M-S McPeek 

Categorical 

No 

Known 

GLOGS 

S Stanhorpe and M Abney 

Categorical 

Yes 

Known 

MASTOR 

L Jakobsdottir and M-S McPeek 

Continuous 

Yes 

Known 

SOLAR 

J Blangero, K Lange, T Dyer, L 
Almasy, H Goring, J Williams, M 
Boehnke, C Peterson 

Continuous/ 

Categorical 21 

Yes 

Known 

P3D 

E Buckler, T Casstevens, P Bradbury 

Continuous/ 

Categorical 21 

Yes 

Cryptic 

EMMAX 

H Kang 

Continuous/ 

Categorical 21 

Yes 

Cryptic 

FaST-LMM 

C Lippert, J Listgarten, and D 
Heckerman 

Continuous/ 

Categorical 21 

Yes 

Cryptic 

GEMMA 

X Zhou and M Stephens 

Continuous/ 

Categorical 21 

Yes 

Cryptic 

BOLT-LMM 

P-R Lou 

Continuous/ 

Categorical 21 

Yes 

Cryptic 

GRAMMAR- 

Gamma 

YS Aulchenko 

Continuous/ 

Categorical 21 

Yes 

Cryptic 

GCTA 

J Yang and P Visscher 

Continuous/ 

Categorical 21 

Yes 

Cryptic 

ROADTRIPS 

T Thornton 

Categorical 

No 

Both 

Mendel 

H Zhou, J Blangero, T Dyer, K Chan, 
E Sobel, K Lange 

Continuous/ 

Categorical 21 

Yes 

Both 


a These variance component methods are designed for a continuous trait; however, it is possible to use them to analyze a 
categorical trait if proper attention is given to rare variants, rare diseases and small sample sizes 


the Elcman 60-Faces emotion recognition test [46] was used as an 
endophenotype for schizophrenia. All individuals in the data set 
have both genotype and phenotype information (see Note 2). 

The phenotype is correlated with the covariates: age, sex and 
education. The variance explained by the SNPs in this study was 
50 %. There are six causal variants, with equal effect size distributed 
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Table 2 

A list of available methods with software format and links to the software page 


Method 

Software format 

Link 

FBAT 

Command Line and menu driven. Source http://www.hsph.harvard.edu/fbat 

Code and Win/Mac/Linux 
executables 

LAMP 

Command Line. Source Code and Win/ 
Mac/Linux executables 

http : //www. sph. umich. edu /csg/abecasis / 
LAMP 

MQLS 

Command Line. Source Code and Win/ 
Mac/Linux executables 

http://www.stat.uchicago.edu/~mcpeek/ 
software/MQLS/index.html or http:// 
www. sph. umich. edu/csg/liang/MQLS 

GLOGS 

Command Line. C Source Code 

http : // www. bioinformatics. org/-stanhope / 
GLOGS 

MASTOR 

Command Line. C Source Code 

http://www.stat.uchicago.edu/~mcpeek/ 
software/MASTOR 

SOLAR 

Command Line. Win/Mac/Linux 
Executable 

http: //solar, txbiomedgenetics. org / 

P3D 

TASSEL: JAVA Command Line, Win/ 
Mac/Linux Executable. R Library 

GAP IT. SAS Code 

http://www.maizegenetics.net/statistical- 

genetics 

EMMAX 

Command Line. Source Code and Win/ 
Linux executables 

http: // genetics. cs. ucla. edu/emmax 

FaST-LMM 

Command Line. Win/Linux executables 

http : //mscomp bio. codeplex. com / 

GEMMA 

Command Line. Linux executable 

http://stephenslab.uchicago.edu/software. 

html#gemma 

BOLT-LMM 

Command Line. Linux executable 

http://www.hsph.harvard.edu/alkes-price/ 

software/ 

GRAMMAR- 

Gamma 

R library G e nAB E L 

http : // www. genabel. org 

GCTA 

Command Line. Linux executable 

http : // www. complextr ait genomics. com / 
software/gcta 

ROADTRIPS 

Command Line. Source code 

http : //www. stat. uchicago .edu/-mcpeek/ 
software/ROADTRIPS 

Mendel 

Command Line. Source code 

http:// genetics.ucla.edu/software/mendel 


across the genome. They are SNPs: rs2393646, rsl0793370, 
rs6769400, rs6774660, rs210400, rsl481440, and rs9465317 
(see Note 3). 


3.2 Method 


Here we describe, step-by-step the implementation of the GCTA 
software to carry out the association while adjusting for relatedness. 
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Fig. 1 A Manhattan plot of the results from the MLMA option in GCTA 

1. Download the data (see Note 1) and the GCTA software 
(see Table 2) and run the commands: 

gcta64 --bf ile CEU --make-brm - -autosome --out CEU 
gcta64 --mlma-loco --mlma-no-adj-covar --bf ile CEU 
--grm-bin CEU --out EK --pheno EK.txt --qcovar 
age_edu. txt --covar gender . txt 

2. Load results in R [47] or Haploview [48] and create a Manhat¬ 
tan plot (see Note 4). 

3. Create a QQ-plot and genomic inflation factor (2) (see Note 5). 
Convention suggests that with 2 < 1.05, your ^-values are not 
overly inflated due to relatedness, population stratification or 
some other reason. 

3.3 Results As can be seen in Fig. 1, there are no SNPs with a ^-value less than 

the conventional threshold for genome wide significance of 
5 x 10 -8 . This study consists of multiple causal variants and a 
small sample size. To reach genome wide significance would require 
one causal variant with a large effect size and/or a very large sample 
size. If you examine the ^-values for the causal SNPs in this simula¬ 
tion you will find that, while they have very small ^-values, they do 
not appear in a list of the top ten SNPs with the best evidence for 
association. This indicates that any study with similar attributes, the 
top ten hits are likely to be false positives. 

Figure 2 shows the QQ-plot for the p-values produced when 
adjusting for relatedness using the GCTA software, while Fig. 3 
shows the QQ-plot when relatedness was not adjusted for (see 
Note 6). Note that the points do not lie on the straight line in 
Fig. 3 but they do in Fig. 2. This indicates that, unless relatedness is 
taken into account, the ^-values are spuriously inflated. When using 
the leave-one-chromosome-out command in GCTA, the ^-values 
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Fig. 2 The QQ-plot for the GCTA MLMA results (see Note 5) 



Fig. 3 The QQ-plot for association results, not adjusted for relatedness (see Note 5) 

show a similarly high level of genomic inflation as in Fig. 3. This 
suggests that, with as few as 65,000 SNPs, it is important to use all 
available SNPs to estimate the genetic relationships matrix. 

3.4 Discussion There is thus a large and continually expanding array of tools 

available for undertaking genome-wide association studies in the 
presence of relatedness, whether known or unknown, ancient or 
recent. It is apparent that much research has focused on develop¬ 
ment of linear mixed model methods, with an emphasis on refining 
methods to reduce computational costs. Precisely which tool will 
be most appropriate for a given study will depend on the character¬ 
istics of the sample and the population from which it was drawn, as 
well as considerations relating to computation time and memory 
usage. The tables above provide a starting point for identifying the 
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salient features of each method, and for narrowing down the 
choices. 

All of the methods described here, except two (LAMP and 
MQLS), can either adjust for unknown population stratification or 
include principal components as a covariate. Care should be taken 
when using principal components as a covariate as standard PCA 
methods can potentially be confounded by the presence of related 
individuals in a sample and so PCA methods accounting for this 
should be used in preference [50]. 

Overall, the GRAMMAR-Gamma method implemented in 
GenABEL is the fastest method, and one of the easiest to imple¬ 
ment. However, it utilizes an approximate, rather than exact, 
method that means it is not well-suited to samples with complicated 
pedigrees or studies that are potentially under-powered [22]. In 
such instances, an exact method would be preferable, despite the 
increase in computational demands. 

For case-control studies, the methods MQLS and ROAD- 
TRIPS are best for incorporating participants with incomplete 
data. This is a relatively common scenario, where a given study 
may include participants with missing genotype, phenotype or 
covariate data. In a familial prostate cancer study, for example, the 
authors found that by using MQLS, and thus incorporating indi¬ 
viduals with unknown case status (women and young men) and 
unknown genotype (ungenotyped affected brothers), an increase in 
power was obtained that was of similar size to the loss of power that 
was a result of the cases being related [51]. It is also worth noting 
that adjusting for relatedness using a mixed model approach can be 
of value even in an ideal, unrelated population, because it will 
address the issue of confounding introduced by “genetic back¬ 
ground”; that is, other unidentified causal loci elsewhere in the 
genome [2]. 

One final consideration is that case-control GWA studies often 
oversample disease cases to increase study power. This leads to 
ascertainment bias that can result in a loss of power when adjusting 
for covariates [52]. Yang et al. [32] show, through simulation, that 
in studies with large ascertainment bias (that is, when the disease 
prevalence is small and the sample size is large), the linear mixed 
effects methods described in this review will suffer a substantial loss 
of power. For studies with unrelated individuals, it is better to 
adjust for population structure using standard PCA methods. For 
studies with ascertainment bias and related individuals that require 
adjusting for covariates, it is possible that the method based on 
logistic regression, GLOGS will not suffer the same loss of power. 
However, further simulation studies are required to confirm this. 

To minimize false GWAS positives, it is important to adjust for 
confounding due to familial relatedness within samples. This review 
has identified a range of tools that can be used to achieve this goal in 
different scenarios. This is an area currently receiving a great deal of 
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attention, and new packages are continually being developed. As 
whole genome (and whole exome) sequencing becomes more 
common, and sample sizes continue to expand, we can expect to 
see more tools produced that address the needs of these studies and 
address some of the limitations of the current methods. 


4 Notes 


1. This can be accessed at https://cloudstor.aarnet.edu.au/plus/ 
index.php/s/fznOGhsDovAl A6z. 

2. The nature of the data will influence the choice of analysis 
software. In the current scenario, genotype information is avail¬ 
able for all individuals and GCTA is therefore appropriate. When 
working with a dataset that included ungenotyped individuals, 
using ROADTRIPS for case-control studies or MAS TOR for quan¬ 
titative trait studies would allow these individuals to be included 
and potentially increase the power of the study. 

3. The simulation of the phenotype was based on (a)—linear 
regression in R for the correlation with the covariates and 
(b)—GCTA for the association with the causal variants. The 
GCTA simulation used the option --simu-qt, which assumes a 
simple additive genetic model. 

4. To generate the Manhattan plot in Haploview, the output file 
from GCTA (filename: EK.mlma) can be entered in to Haplo- 
view using the PLINK format option. The R code to generate 
the Manhattan plot is as follows: 

get a <- read.table("EK.mlma",header=T) 
lengthchr <- tapply(gcta$bp,gcta$Chr,max) 

pos <- c(0,sapply(1:22,function(i) 
sum(as.numeric(lengthchr[1:1])))) 

png("manhattan_plot_gcta.png",width=800, 
height=500) 

par(mar=c(4,5,1,1)) 

plot(pos[gcta$Chr]+gcta$bp,-1*loglO(gcta$p), 
cex=0.6,xaxt="n",ylab="-loglO(p-value)", 
xlab="Chromosome",lwd=2,cex.lab=2,cex. 
axis=2,lwd=2,col=grey(0.4)) 

abline(v=pos,lty=2) 

midpoint <- sapply(1:22,function(i) (pos[i]+pos[i 
+1])/2) 

axis(1,at=pos,labels=F) 

axis(1,at=midpoint,tick=FALSE,labels=l:22,cex. 
axis=2) 


dev.off() 
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5. The R code used to generate the QQ-plot is as follows: 
get a <- read.table("EK.mlma",header=T) 

pv = gcta$p 
m=length(pv) 

expect.stats=-loglO(seq(1/(m+1),m/(m+l),length. 
out=m)) 

lambda=median(-log10(pv))/median(expect.stats) 
png("QQplot_gcta.png") 
par(mar=c(5,5,1,1)) 

qqplot(expect.stats,-loglO(pv),xlab="expected - 
loglO(pvalues)",ylab="observed -loglO(pva- 
lues)",cex.axis=2,cex.lab=2,lwd=2) 

abline(a=0,b=l) 

text(0,5,bquote(lambda==.(round(lambda,3))), 
adj=0,cex=2) 

dev.off() 

6. This association analysis was conducted without adjusting for 
relatedness using PL INK [49], via the command: 

pi ink - -noweb --bf ile CEU --pheno EK. txt --linear -- 
covar age_edu.txt --sex --out CEU_assoc_plink 


References 

1. Astle W, Balding DJ (2009) Population struc¬ 
ture and cryptic relatedness in genetic associa¬ 
tion studies. Stat Sci 24:451-471 

2. Vilhjalmsson BJ, Nordborg M (2013) The 
nature of confounding in genome-wide associ¬ 
ation studies. Nat Rev Genet 14:1-2 

3. Manolio TA, Collins FS, Cox NJ, Goldstein 
DB, Hindorff LA, Hunter DJ, McCarthy MI, 
Ramos EM, Cardon LR, Chakravarti A et al 
(2009) Finding the missing heritability of com¬ 
plex diseases. Nature 461:747-753 

4. Jakkula E, Leppa V, Sulonen A-M, Varilo T, 
Kallio S, Kemppinen A, Purcell S, Koivisto K, 
Tienari P, Sumelahti M-L et al (2010) 
Genome-wide association study in a high-risk 
isolate for multiple sclerosis reveals associated 
variants in STAT3 gene. Am J Hum Genet 
86:285-291 

5. McQuillan R, Leutenegger A-L, Abdel- 
Rahman R, Franklin CS, Pericic M, Barac- 
Lauc L, Smolej-Narancic N, Janicijevic B, Pola- 
sek O, Tenesa A et al (2008) Runs of homozy¬ 
gosity in European populations. Am J Hum 
Genet 83:359-372 


6. Zeggini E (2012) Next-generation association 
studies for complex traits. Nat Genet 
43:287-288 

7. Devlin B, Roeder K (1999) Genomic control 
for association studies. Biometrics 
55:997-1004 

8. Price AL, Patterson NJ, Plenge RM, Weinblatt 
ME, Shadick NA, Reich D (2006) Principal 
components analysis corrects for stratification 
in genome-wide association studies. Nat Genet 
38:904-909 

9. Price AL, Zaitlen NA, Reich D, Patterson N 
(2010) New approaches to population stratifi¬ 
cation in genome-wide association studies. Nat 
Rev Genet 11:459-463 

10. Wooster R, Neuhausen SL, Mangion J, Quirk 

Y, Ford D, Collins N, Nguyen K, Seal S, Tran 
T, Averill D et al (1994) Localization of a 
breast cancer susceptibility gene, BRCA2 , to 
chromosome 13ql2-13. Science 

265:2088-2090 

11. Li Y, Foo JN, Liany H, Low H-Q, Liu J (2014) 
Combined linkage and family-based associa¬ 
tion analysis improved candidate gene 


Adjusting for Familial Relatedness in the Analysis of GWAS Data 189 


detection in Genetic Analysis Workshop 18 
simulation data. BMC Proc 8:S29 

12. Li M, Boehnke M, Abecasis GR (2005) Joint 
modeling of linkage and association: identify¬ 
ing SNPs responsible for a linkage signal. Am J 
Hum Genet 76:934-949 

13. Spielman RS, Ewens WJ (1998) A sibship test 
for linkage in the presence of association: the 
sib transmission/disequilibrium test. Am J 
Hum Genet 62:450-458 

14. Zhou JJ, Yip W-K, Cho MH, Qiao D, McDo¬ 
nald M-LN, Laird NM (2014) A comparative 
analysis of family-based and population-based 
association tests using whole genome sequence 
data. BMC Proc 8:S33 

15. Almasy L, Blangero J (1998) Multipoint 
quantitative-trait linkage analysis in general 
pedigrees. Am J Hum Genet 62:1198-1211 

16. Blangero J, Diego VP, Dyer TD, Almeida M, 
Peralta J, Kent JWJ, Williams JT, Almasy L, 
Goring HH (2013) A kernel of truth: statistical 
advances in polygenic variance component 
models for complex human pedigrees. Adv 
Genet 81:1-31 

17. Thornton T, McPeek MS (2007) Case-control 
association testing with related individuals: a 
more powerful quasi-likelihood score test. Am 
J Hum Genet 81:321-337 

18. Stanhope SA, Abney M (2012) GLOGS: a fast 
and powerful method for GWAS of binary 
traits with risk covariates in related populations. 
Bioinformatics 28:1553-1554 

19. Jakobsdottir J, McPeek MS (2013) MASTOR: 
mixed-model association mapping of quantita¬ 
tive traits in samples with related individuals. 
Am J Hum Genet 92:652-666 

20. Falconer DS (1965) The inheritance of liability 
to certain diseases, estimated from the inci¬ 
dence among relatives. Ann Hum Genet 
29:51-76 

21. Chen MH, Liu X, Larson MG, Fox CS, Vasan 
RS, Yang Q (2011) A comparison of strategies 
for analyzing dichotomous outcomes in 
genome-wide association studies with general 
pedigrees. Genet Epidemiol 35:650-657 

22. Zhou X, Stephens M (2012) Genome-wide 
efficient mixed model analysis for association 
studies. Nat Genet 44:821-824 

23. Eu-ahsunthornwattana J, Howey RAJ, Cordell 
HJ (2014) Accounting for relatedness in 
family-based association studies: application to 
Genetic Analysis Workshop 18 data. BMC Proc 
8:S79 

24. Listgarten J, Lippert C, Kadie CM, Davidson 
RI, Eskin E, Heckerman D (2012) Improved 
linear mixed models for genome-wide associa¬ 
tion studies. Nat Methods 9:525-526 


25. Loh P-R, Tucker G, Bulik-Sullivan BK, Vilh- 
jalmsson BJ, Finucane HK, Salem RM, Chas- 
man DI, Ridker PM, Neale BM, Berger B, 
Patterson N, Price AL (2015) Efficient Bayes¬ 
ian mixed model analysis increases association 
power in large cohorts. Nat Genet 47:284-290 

26. Svishcheva GR, Axenovich TI, Belonogova 
NM, van Duijn CM, Aulchenko YS (2012) 
Rapid variance components-based method for 
whole-genome association analysis. Nat Genet 
44:1166-1170 

27. Aulchenko YS, Ripke S, Isaacs A, van Duijn 
CM (2007) GenABEL: an R library for 
genome-wide association analysis. Bioinfor¬ 
matics 23:1294-1296 

28. Kang HM, Sul JH, Service SK, Zaitlen NA, 
Kong SY, Freimer NB, Sabatti C, Eskin E 
(2010) Variance component model to account 
for sample structure in genome-wide associa¬ 
tion studies. Nat Genet 42:348-354 

29. Yang J, Lee SH, Goddard ME, Visscher PM 
(2011) GCTA: a tool for genome-wide com¬ 
plex trait analysis. Am J Hum Genet 88:76-82 

30. Bradbury PJ, Zhang Z, Kroon DE, Casstevens 
TM, Ramdoss Y, Buckler ES (2007) TASSEL: 
software for association mapping of complex 
traits in diverse samples. Bioinformatics 
23:2633-2635 

31. Lynch M, Ritland K (1999) Estimation of pair¬ 
wise relatedness with molecular markers. 
Genetics 152:1753-1766 

32. Yang J, Zaitlen NA, Goddard ME, Visscher 
PM, Price AL (2014) Advantages and pitfalls 
in the application of mixed-model association 
methods. Nat Genet 46:100-106 

33. Tucker G, Price AL, Berger B (2014) Improv¬ 
ing the power of GWAS and avoiding con¬ 
founding from population stratification with 
PC-Select. Genetics 197:1045-1049. doi:10. 
15 34/genetics. 1114.16428 5 

34. Thornton T, McPeek MS (2010) ROAD- 
TRIPS: case-control association testing with 
partially or completely unknown population 
and pedigree structure. Am J Hum Genet 
86:172-184 

35. Lange K, Papp JC, Sinsheimer JS, Sripracha R, 
Zhou H, Sobel EM (2013) Mendel: the Swiss 
army knife of genetic analysis programs. Bioin¬ 
formatics 29:1568-1570 

36. The 1000 Genomes Project Consortium 
(2012) An integrated map of genetic variation 
from 1,092 human genomes. Nature 
491:56-65 

37. Svishcheva GR, Belonogova NM, Axenovich 
TI (2014) FFBSKAT: fast family-based 
sequence kernel association test. PLoS One 9: 
e99407 


190 


Russell Thomson and Rebekah McWhirter 


38. Uemoto Y, Pong-Wong R, Navarro P, Vitart V, 
Hayward C, Wilson JF, Rudan I, Campbell H, 
Has tie ND, Wright AF et al (2013) The power 
of regional heritability analysis for rare and 
common variant detection: simulations and 
application to eye biometrical traits. Front 
Genet 4, Article 232 

39. Liu JZ, Mcrae AF, Nyholt DR, Medland SE, 
Wray NR, Brown KM, Investigators AMFS, 
Hayward NK, Montgomery GW, Visscher PM 
et al (2010) A versatile gene-based test for 
genome-wide association studies. Am J Hum 
Genet 87:139-145 

40. Hu H, Roach JC, Coon H, Guthery SL, Voelk- 
erding KV, Margraf RL, Durtschi JD, Tavtigian 
SV, Shankaracharya, Wu W et al (2014) A uni¬ 
fied test of linkage analysis and rare-variant 
association for analysis of pedigree sequence 
data. Nat Biotechnol 32:663-669 

41. Jiang D, McPeek MS (2014) Robust rare vari¬ 
ant association testing for quantitative traits in 
samples with related individuals. Genet Epide¬ 
miol 38:10-20 

42. Liu F, Struchalin MV, van Duijn K, Hofman A, 
Uitterlinden AG, Aulchenko YS, Kayser M 
(2011) Detecting low frequent loss-of-func¬ 
tion alleles in genome wide association studies 
with red hair color as an example. PLoS One 6: 
e28145 

43. Oualkacha K, Dastani Z, Li R, Cingolani PE, 
Spector TD, Hammond CJ, Richards JB, 
Ciampi A, Greenwood CMT (2013) Adjusted 
sequence kernel association test for rare var¬ 
iants controlling for cryptic and family related¬ 
ness. Genet Epidemiol 37:366-376 


44. De G, Yip W-K, Ionita-Laza I, Laird N (2013) 
Rare variant analysis for family-based design. 
PLoS One 8:e48495 

45. Thorisson GA, Smith AV, Krishnan L, Stein LD 
(2005) The international HapMap project web 
site. Genome Res 15:1592-1593 

46. Ekman P, Friesen WV (1976) Pictures of facial 
affect. Consulting Psychologists Press, Palo 
Alto, CA 

47. R Core Team (2014) R Foundation for Statis¬ 
tical Computing, Vienna, Austria 

48. Barrett JC, Fry B, Mailer J, Daly MJ (2005) 
Haploview: analysis and visualization of LD and 
haplotype maps. Bioinformatics 21:263-265 

49. Purcell S, Neale B, Todd-Brown K, Thomas L, 
Ferreira MAR, Bender D, Mailer J, Sklar P, de 
Bakker PIW, Daly MJ et al (2007) PLINK: a 
tool set for whole-genome association and 
population-based linkage analyses. Am J Hum 
Genet 81:559-575 

50. Thornton TAA, Austin MA (2013) Software 
and data resources for genetic association stud¬ 
ies: Mini Review. CAB Rev 8:1-6 

51. Fitzgerald LM, Patterson B, Thomson R, Pola- 
nowski A, Quinn S, Brohede J, Thornton T, 
Challis D, Mackey DA, Dwyer T et al (2009) 
Identification of a prostate cancer susceptibility 
gene on chromosome 5pl3ql2 associated with 
risk of both familial and sporadic disease. Eur J 
Hum Genet 17:368-377 

52. Pirinen M, Donnelly P, Spencer CC (2012) 
Including known covariates can reduce power 
to detect genetic effects in case-control studies. 
Nat Genet 44:848-851 



Chapter 11 


Analysis of Quantitative Itait Loci 

David L. Duffy 

Abstract 

Although the term quantitative trait locus (QTL) strictly refers merely to a genetic variant that causes 
changes in a quantitative phenotype such as height, QTL analysis more usually describes techniques used to 
study oligogenic or polygenic traits where each identified locus contributes a relatively small amount to the 
genetic determination of the trait, which may be categorical in nature. Originally, too, it would be clear that 
it covered segregation and genetic linkage analysis, but now genetic association analysis in a genome-wide 
SNP or sequencing experiment would be the commonest application. The same biometrical genetic 
statistical apparatus used in this setting—analysis of variance, linear or generalized linear mixed models— 
can actually be applied to categorical phenotypes, as well as to multiple traits simultaneously, dealing with 
and taking advantage of genetic pleiotropy. Most recently, they are being used to make inferences about 
population and evolutionary genetics, with applications ranging from human disease to control of disease- 
causing organisms. Several computer software packages make it relatively straightforward to fit these 
statistically complex models to the large amounts of genotype and phenotype data routinely collected today. 

Key words Biometrical genetics, Mixed model, Kinship, Linkage analysis, Association analysis, Link¬ 
age disequilibrium, Population genetics 


1 Introduction 


The basic QTL model is a simple linear regression equation with 
one trait and one measured locus [ 1 ]: 

y% = Si + ** 

where _%• is the trait value for the it h individual;^ is the average trait 
value in that population for the particular genotype the individual 
carries (genotypic mean); and e t is the perturbation from the geno¬ 
typic mean in that individual (residual), due to the buffeting of 
environmental and developmental factors, which we will model as 
being a random value drawn from some statistical distribution such 
as the Gaussian. The simplest extension of this model includes 
terms for those environmental and developmental factors that can 
be measured—a multiple regression. 
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The usual tests for the adequacy of a linear model must be 
applied here; otherwise statistical tests of significance will not per¬ 
form correctly. That is, the observed distribution of residuals 
should match the chosen theoretical distribution, residuals should 
be uncorrelated with genotype, and residuals should not be corre¬ 
lated between different individuals. In most genome-wide associa¬ 
tion scans (GWASs), a simple regression of this type is the 
workhorse, fitting one regression per measured locus. The quanti¬ 
le-quantile (QQ) plot seen in almost every GWAS paper is one 
graphical test that can highlight failures in the distributional 
assumptions. For example, when fewer significant loci are detected 
than would be expected by chance, this suggests model misspecifi- 
cation (or widespread data mismeasurement)—this will not be due 
to confounding by population genetic structure, though it can 
reflect batch effects in genotyping error rates if batches also differ 
by phenotype [2]. A formal test for determining whether a mathe¬ 
matical transformation of the trait values (e.g., log, square root) will 
improve model adequacy is the Box-Cox maximum likelihood 
approach [3]. 

The usual linear model is fitted via ordinary least squares 
(OLS), which is equivalent to assuming a residual Gaussian distri¬ 
bution. The far more flexible generalized linear model (GLM) [4] 
incorporates a link function (a phenotype transformation) and a 
variety of residuals distributions in a maximum likelihood frame¬ 
work fitted by iterative methods. The most familiar GLM is logistic 
regression, which allows the basic model to be applied to binary 
traits (most usually disease states in the human literature), but 
GLMs extend to ordered and unordered (multi-)categorical traits, 
survival times, and odd-shaped continuous distributions, the com¬ 
monest being Poisson, negative binomial, and Gamma distribu¬ 
tions. These models are all available using standard statistical 
software. 

The cause of a correlation between residuals and genotype that 
most concerns geneticists is the existence of ethnic stratification, 
where different subpopulations differ in genotype frequencies and 
in phenotype distribution [2]. This is particularly important 
because the effects of individual QTLs studied are generally small. 
If the presence of this stratification in a study sample is not recog¬ 
nized, combining the subpopulations in an analysis leads to spuri¬ 
ous QTL detection. A more extreme example is when the sample 
contains close relatives. One approach to dealing with these phe¬ 
nomena is to elaborate the linear model to allow correlated resi¬ 
duals, a random effects model. Since we will retain the measured 
genotypes and other covariates in the model, these are usually called 
mixed models , because they contain both fixed and random effects 
[5]. Random effects in genetic models are relatively unusual in that 
we can strongly specify the correlations between the individuals 
using either genetic theory (in the case of relatives we know these 
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from Mendel’s laws) or genomic sharing (an empirical or realized 
relationship) based on genome-wide genotyping [2]. This model 
can be written [1]: 

y = Qg +Xb +Zu + e 

where y, u, e are now vectors of values for the sample (of length n); 
g the regression coefficient for each genotype class (e.g., a vector of 
length 3 for a simple nucleotide variant); b the regression coeffi¬ 
cients for other measured covariates; u that portion of the residuals 
that are correlated with one another; e the uncorrelated portion; 
and Q and X are matrices indicating the genotype or covariate value 
for each individual. The Z matrix maps individuals onto the u 
values and will usually be an identity matrix for our purposes, but 
can vary to handle monozygotic twins and repeated measures. For 
the model to be fittable by iterative likelihood-based methods, we 
must specify the correlation matrices for each random effect: for e, 
an n x n diagonal matrix E (zero correlation between ej and ej, 
i / j), while for u, it is the known relatedness of individuals i and j, 
an n x n matrix G estimated from pedigree and/or GWAS data 
[6-8]. Members of the same subpopulation (or family) will have 
similar u values, which will capture the relationship between eth¬ 
nicity and trait values that causes confounding in the simpler mod¬ 
els. Again a number of computer packages are now available for this 
task [9-11], but the computational task is intensive (i.e., time 
consuming) for the large datasets currently studied. 

Fitting of a generalized linear mixed model to, for example, a 
binary trait is even more intensive, and so is usually not feasible for 
an entire modern GWAS. Most current published studies using 
mixed models for binary traits treat them as if they are continuous 
(coding them as 0 and 1), a practice that can only ever be approxi¬ 
mately correct (and which should be borne in mind assessing 
significance tests from these studies). Categorical traits can also be 
analyzed using the same software, by creating dummy binary vari¬ 
ables that indicate membership of each category (dropping one 
category to make the model identified). In practice, if the category 
and genotypes proportions are high enough, then results are rea¬ 
sonably trustworthy. 

In the case that QTL genotypes are not directly measured, 
these too can be treated as additional random effects. The appro¬ 
priate correlation matrix for this random effect is the empirical 
relationship matrix for the genetic region of the QTL—the corre¬ 
lation matrix that measures the probability that alleles at the QTL in 
a pair of individuals were inherited from the same ancestor (i.e., are 
identieal by deseent). This is a variance components genetie linkage 
analysis , and a likelihood ratio test can be used to determine linkage 
of trait phenotypes to QTL genotypes. Statistical power of linkage 
analysis in the natural pedigrees accessible for humans is much 
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lower than that of association analysis using measured QTL geno¬ 
types. It does have the advantage that linkage of markers to the 
QTL extends over much greater distances along the chromosome 
than linkage disequilibrium, so that fewer markers need to be 
genotyped. In addition, because linkage extends further, a less 
severe correction for multiple testing is required, so that a linkage 
test P < 6 x 10 -5 is genome-wide significant with an experiment- 
wise Type 1 error rate of -0.05 [12]. For historical reasons, the 
linkage test result is usually displayed as the decimal log likelihood 
ratio— lod , where the above P -value corresponds to a lod of 3.5. 

More elaborate models that incorporate segregation and link¬ 
age disequilibrium information from pedigrees can be fitted using 
specialized software [13], but again these are most useful when the 
effect size for the QTL is sizeable—examples would include the 
ACE insertion-deletion polymorphism and serum ACE levels, 
major “Mendelian” QTLs like those for familial hypercholesterol¬ 
emia, and many variants affecting gene expression ( eQTLs ). In 
families segregating these variants, the trait distribution is often 
obviously multimodal, each mode representing a genotype. There 
is much current interest in incorporating linkage information in the 
identification of rare causative variants in sequence data as although 
the information contribution from linkage analysis may not be 
large, when combined with association and functional evidence, it 
may be decisive in choosing between several likely variants [14]. 

The same equations can be extended to multiple phenotypes 
simultaneously and to incorporate genotypes at multiple loci simul¬ 
taneously. In the case of multitrait analyses, one is able to estimate 
the contribution of the QTL to the phenotypic correlation between 
pairs of traits. Furthermore, if traits are not strongly phenotypically 
correlated, and a QTL is pleiotropic in action, then this increases 
statistical power for gene detection [15]. 

When modeling multiple QTLs simultaneously, one often 
encounters model identification problems due to having too few 
data points per regression coefficient to be estimated. The first 
simplification we can carry out is model genotypic effects as a sum 
of allelic effects (a regression of allele dose on genotype means). 
This additive model is usually quite appropriate (it neglects domi¬ 
nance effects, which are mostly small) and halves the parameter 
count. In another direction, approaches similar in spirit to stepwise 
regression attempt to reduce the number of loci in the model to a 
core set of influential QTLs and can be fitted more easily. These can 
be Bayesian in nature or various types of penalized regression [16]. 

A related problem is the use of large-scale genotypic data to 
predict an individual phenotype value (genomic prediction)—this is 
of great interest to animal breeders, and in the populations they 
study can be quite precise. In human populations, such a prediction 
can be of interest for genetic epidemiology, notably when testing if 
QTLs for one trait can be used to predict a second trait (a test of 
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multilocus pleiotropy). A currently theoretical application is to use 
the predicted value for a disease phenotype to tailor individual 
disease screening regimens or therapy. In the dosage polygenic 
risk score model, one fits an additive model to each locus in turn, 
and uses the resulting coefficients in a full prediction model using 
the observed genotypes. The assumption under this model is that 
there is no significant epistasis, that is, gene-by-gene interaction. 


2 A Standard Workflow for QTL Association Analysis of a GWAS 

Almost all the computer programs suggested here are run from the 
command line, usually in Unix or related operating systems such as 
OSX, although many will run under Windows (see Note 1). Most 
require a lot of memory. GWAS analyses are usually best run on 
computer clusters, as the analyses can be parallelized easily by 
dividing up data by chromosome or chromosome chunk. 

Determine an appropriate transformation of the phenotype 
under study. Usually, information will be available in the literature, 
for example, log transformation of body mass index or serum 
cholesterol levels. In the case of eQTLs, quite elaborate transfor¬ 
mation ( normalization ) is essential. The Box-Cox procedure can 
be used to confirm or select an optimal transform, ideally including 
a number of known covariates in the regression. Alternatively, a 
rankit transformation is not uncommonly used, where each value is 
replaced with the normal score corresponding to the same rank in 
the observed distribution. This preliminary is not specific to genetic 
analysis but is an important first step. 

If the dataset includes closely related individuals, carry out a 
purely family based phenometric mixed model analysis, using the 
kinship matrix based on the known pedigree. This gives an estimate 
of the trait heritability. This can point to errors, and if enough data 
is available, gives an upper bound to combined effects of the QTLs. 
Most mixed model packages will allow one to enter a pedigree, 
from which the appropriate correlation matrix (often the matrix 
inverse, as this saves some calculation) will be calculated for this 
analysis, e.g., polygenic() and poly < penic_h < plm() in the R GenABEL 
package [17], the R MCMCjylmm package [18], MERLIN [19], 
MENDEL [13], or SOLAR [20]. 

Carry out appropriate cleaning of the available genotype data 
(as discussed in Chapter 9 in this volume). Imputation of unob¬ 
served genotypes has the advantage of increasing statistical power 
by ~10 % and improving localization of the true QTL, as opposed 
to nearby variants in linkage disequilibrium. Principal components 
for study participant genotypes can be calculated to use as covari¬ 
ates that offer another way of controlling effects of ethnic stratifi¬ 
cation. Genotype data can be stored as PLINK or compressed VCF 
files. The PLINK2 program is one package that can convert back¬ 
wards and forwards between these formats (https://www.cog-geno 
mics. or g /plink2 / ). 
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There is a large and increasing number of computer programs 
that fit QTL models. The GEMMA package (http://www.xzkb. 
org/software.html) [10, 21] is a (relatively) friendly and fast pro¬ 
gram for estimating empirical kinship matrices and performing 
linear mixed model association analysis. It also offers a Bayesian 
sparse linear mixed model approach to reducing the number of 
QTLs in the model to a manageable number. It reads a PLINK 
(or a BIMBAM) genotype file and a plain text phenotype file. An 
initial run calculates the empirical relationship matrix, which can 
then be used in subsequent mixed model fitting runs. Here is 
output from the calculation of the relationship matrix (based on 
PLINK format data files colour4.fam, colour4.bim, col¬ 
our 4. bed, and phenotype file dummy4 . dat ): 

## GEMMA Version = 0.93 

## 

## Command Line Input = -b colour4 -p dummy4.dat -gk - 
miss 0.5 -o GABC 

## 

## Summary Statistics: 

## number of total individuals = 2190 
## number of analyzed individuals = 2190 
## number of covariates = 1 
## number of total SNPs = 911857 
## number of analyzed SNPs = 811485 
## 

## Computation Time : 

## total computation time = 54.8707 min 
## computation time break down: 

## time on calculating relatedness matrix = 54.395 min 

Since a small correlation must exist between overall relationship and 
resemblance at a QTL, it is now usual to produce a relationship 
matrix excluding the chromosome of the genotype currently being 
tested for association—this slightly increases power to detect QTLs 
[22]. The results from testing for QTLs for human hair color 
treated as four unordered categories using GEMMA, where the - 
n option is used to select 3 of the 4 dummy variables (cols 1,2, and 
4) encoding these in the file dummy4 . dat: 

## 

## GEMMA Version = 0.94beta 

## 
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## Command Line Input = -b colour4 -p dummy4. dat -n 1 2 4 
-k GABC . cXX . txt -1mm -miss 0.5 -o hc4 

## 

## Summary Statistics: 

## number of total individuals = 2190 

## number of analyzed individuals = 1179 

## number of covariates = 1 

## number of phenotypes = 3 

## number of total SNPs = 1109421 

## number of analyzed SNPs = 810682 

## REMLE log-likelihood in the null model = 183.076 

## MLE log-likelihood in the null model = 185.7 32 

## REMLE estimate for Vg in the null model: 


0.0267261 

0.000351117 

0.187691 


-0.00381588 

-0.0265142 

0.12686 

## se(Vg) : 

0.00581771 

0.00906147 

0.0289283 


0.00429616 

0.0102545 

0.00876679 


followed by results for known hair color loci, processed using the 
postpfwtts Rpackage [23], which automatically recognizes GEMMA 
output files (Table 1). 

The R commands which automatically generated plots and the 
previous table were as follows: 

library(postgwas) 

resuits = read.delim("he4.assoc.txt") 
postgwas(results) 
quit(save=no) 


Table 1 

Significant associations for known hair color loci 


SNP 

CHR 

BP 

p 

Gene name 

rs35395 

5 

33948589 

3.6e—08 

SLC45A2 

rsl2203592 

6 

396321 

2.8-11 

IRF4 

rsl2913832 

15 

28365618 

2.0e—31 

HERC2 

rsl805008 

16 

89986144 

6.2e—21 

MC1R 





198 


David L. Duffy 


3 A Standard Workflow for QTL Linkage Analysis 

I will not specifically discuss experimental breeding studies, where 
particular tailored methods can take advantage of the design (e.g., 
the R/QTL package [24]), but the general approaches later can be 
applied to such data. 

QTL linkage analysis of natural pedigrees is now usually per¬ 
formed where SNP genotype data has been collected, though 
microsatellite or RAPD markers may still be used in nonhuman 
pedigrees. Specialized linkage SNP arrays with 7000-11,000 mar¬ 
kers are still available from some suppliers, but the fall in cost of 
denser SNP panels and of sequencing has made these less attractive. 
Since linkage extends over long genetic distances, and linkage 
disequilibrium actually makes linkage analysis more technically dif¬ 
ficult, the data are first thinned to retain -10,000 informative SNPs 
in linkage equilibrium with one another. Marker informativeness 
for linkage is proportional to the number of heterozygous 
genotypes in the pedigrees to be analyzed (i.e., common SNPs are 
more useful). SNP genotype data is cleaned as for association 
analysis (see Chapter 27), with the availability of pedigree data 
meaning that testing for genotype errors can take advantage of 
Mendelitm error detection , where genotypes of parents and off¬ 
spring must also be consistent at each locus, and also in terms of 
apparent recombination rates between neighboring loci (close dou¬ 
ble recombinants are very likely to represent a genotyping error). 
Similarly, the pedigree structure must be checked for nonpaternity 
and sample mix-up by comparing the empirical kinship matrix to 
the expected kinship matrix based on the reported pedigree. 

Parametric QTL linkage analysis requires one to estimate or 
specify a model for the relationship between QTL genotype and the 
trait. If the QTL effect is large, this can be estimated using segrega¬ 
tion analysis of pedigree data. The “nonparametric” variance com¬ 
ponents linkage (i.e., mixed model) approach does not require this 
step, and we concentrate on that later. 

The MERLIN program (http://csg.sph.umich.edu/abecasis/ 
Merlin/index.html) [19] supports both parametric and nonpara- 
metric linkage analysis, association analysis, and Mendelian error 
detection. It also imputes missing genotypes utilizing the pedigree 
structure. Three data files must be prepared (Table 2): one listing 
the traits and markers (“.dat” file), a list of marker genetic map 
positions in sex-averaged centiMorgans (“.map” file), and the 
parents and actual genotype and phenotype data for each individual 
(“.ped” file)—the first five fields of the pedigree file are pedigree 
ID, individual ID, father ID, mother ID, and sex. Genetic map 
positions for SNPs can be estimated from files under http:// 
hapmap.ncbi.nlm.nih.gov/ downloads/recombination/, which 
give sequence coordinates (bp) and genetic map positions (cM) 
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Table 2 

Input files required by MERLIN 


.dat file 

.map file 

.ped file 

T traitl 

CHR MARKER POS 

Pedl idl 0 0 m 12.1 1.1 1 3/ 1 1/ 1 4/ 4 

C covariatel 

13 rS9579484 0 

Pedl id2 0 0 f 14.4 2.3 2 0/0 0/0 0/0 

C covariate2 

13 rs9552488 0.021786 

Pedl id3 idl id2 f 9.1 5.3 1 1/ 3 1/ 1 
4/4 

M rS9579484 

13 rS9510743 4.99304 

Pedl id4 idl id2 f 16.2 2.7 1 1/ 3 1/ 1 
4/ 4 

M rS9552488 

M rS9510743 


along each chromosome. These MERLIN file formats differ only 
slightly from the PLINK2 “.ped” and “.map” file formats (the 
latter has fields for both sequence and genetic map positions). 

The MERLIN- associated pedstuts program produces summary 
statistics and checks the pedigree file for errors. For example, all 
included individuals need to have both parents specified and a 
record for each included in the .ped file, or to have neither parent 
specified (where the individual is a pedigree founder). It is called 
from the command line as follows: 

pedstats -d example . dat -p example .ped 

MERLINS Mendelian errors checking is invoked, as are all the 
other analyses, by using a command line flag: 

mer lin -d example . dat -m example .map -p example .ped-- 
errors 

This produces a list of putative errors and a P-value in a file 
merlin.err. Another program, pedwipe , can then delete these auto¬ 
matically. The resulting “wiped.ped” file can then be analyzed. For 
example, 

mer lin -d skincol. dat -m skincol .map -p skincol.ped-- 
assoc 

gives the following output for an analysis of the region around the 
SLC45A2 gene associated with human pigmentation (as evidenced 
for hair color in the previous section), examining skin color 
measured as an ordinal trait (the “—assoc” option automatically 
performs linkage and association analyses, while the “—vc” option 
gives linkage only): 



Phenotype: skincol (ASSOC) (503 families, h2 = 100.00 %) 
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Chromosome 5 position (cM) 

Fig. 1 Linkage and Association results for a region of human chromosome 5 
around the SLC45A2 gene 

Extracting and plotting the positions and P-values from the 
above output (see Fig. 1) shows again that both linkage and associ¬ 
ation signals peak very closely, in both cases exceeding the thresh¬ 
old for genome-wide significance. 


4 Note 


1. There are a large number of computer packages for QTL analy¬ 
sis, and the number continues to increase. For example, our 
group has recently moved from carrying out QTF association 
analysis using MERLIN to RAREMETALWORKER [25], 
which uses the same algorithms as MERLIN , but reads VCF 
files natively and uses very little memory. Just within the R 
statistical computing environment [26] alone, there are the 
bqtl, boss, BayHap, dimap, eqtl, jyap, Gen ABEL, hapassoc, haplo. 
stats, ibdrejq, Idlasso, JAGUAR, MatrixEQTL, mqtl, multic, qtl 
(R/QTL), qtlhot, qtlmt, qtlnet, QTLRel, SNPassoc, snpstats, 
strum, wgaim, and WQ2,packages. A certain amount of func¬ 
tionality is common, so one should expect consistent answers 
from different packages [27], but I routinely perform duplicate 
analyses using different packages, at least for genomic regions 
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containing significant QTLs. In my own GWAS analyses, I usu¬ 
ally deal with closely related individuals (either from nuclear or 
large extended pedigrees), and so the common approach of 
removing relatives from the dataset is not feasible. This is one 
reason I have concentrated in this review on association methods 
using empirical kinship matrices. Unfortunately, these matrices 
can be estimated using different approaches, which sometimes 
causes test results for a given locus to fluctuate significantly. 
Several recent papers have explored methods of stabilizing this 
effect by shrinkage estimation, regularization, or combination of 
multiple estimates (e.g., pedigree and GWAS based) [28, 29]. 
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Chapter 12 


High-Dimensional Profiling for Computational Diagnosis 

Claudio Lottaz, Wolfram Gronwald, Rainer Spang, and Julia C. Engelmann 


Abstract 

New technologies allow for high-dimensional profiling of patients. For instance, genome-wide gene 
expression analysis in tumors or in blood is feasible with microarrays, if all transcripts are known, or even 
without this restriction using high-throughput RNA sequencing. Other technologies like NMR finger 
printing allow for high-dimensional profiling of metabolites in blood or urine. Such technologies for high¬ 
dimensional patient profiling represent novel possibilities for molecular diagnostics. In clinical profiling 
studies, researchers aim to predict disease type, survival, or treatment response for new patients using high¬ 
dimensional profiles. In this process, they encounter a series of obstacles and pitfalls. We review fundamental 
issues from machine learning and recommend a procedure for the computational aspects of a clinical 
profiling study. 

Keywords Microarrays, Gene expression profiles, RNA sequencing, Metabolite analysis, NMR finger 

printing, Statistical classification, Supervised machine learning, Feature selection, Model assessment 


1 Introduction 


In clinical microarray studies, tissue samples from patients are 
examined using microarrays measuring gene expression levels of 
as many as 50,000 transcripts. Such high-dimensional data, possi¬ 
bly complemented by additional information about patients, pro¬ 
vide novel opportunities for molecular diagnostics through 
automatic classification. 

For instance, Roepman et al. [1] describe a study on head and 
neck squamous cell carcinomas. In this disease, treatment strongly 
depends on the presence of metastases in lymph nodes near the 
neck. Fiowever, diagnosis of metastases is difficult. With standard 
diagnosis, more than half of the patients undergo surgery unneces¬ 
sarily, while 23 % remain under treated. Roepman et al. show that 
treatment based on microarray prediction can be significantly more 
accurate: in a validation cohort, undertreatment would have been 
completely avoided while the rate of unnecessary surgery would 
have dropped from 50 % to 14 %. 


Jonathan M. Keith (ed.), Bioinformatics: Volume II: Structure, Function, and Applications, Methods in Molecular Biology, 
vol. 1526, DOI 10.1007/978-1-4939-6613-4_12, ©Springer Science+Business Media New York 2017 
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1.1 The 

Classification Problem 


From a statistical point of view, the major characteristic of 
microarray studies is that the number of genes is orders of magni¬ 
tude larger than the number of patients. For classification this leads 
to problems involving overfitting and saturated models. When 
blindly applying classification algorithms, a model rather adapts to 
noise in the data than to the molecular characteristics of a disease. 
Thus, the challenge is to find molecular classification signatures that 
are valid for entire disease populations. 

In the following, we briefly describe the machine learning 
theory, as it is needed for computational diagnostics using high¬ 
dimensional data. We suggest software solutions in Subheading 2 
and a procedure for a clinical profiling study in Subheading 3. In 
the last section, we mention alternative data sources for high¬ 
dimensional data, point out some alternative classification and 
evaluation strategies, and describe pitfalls in the analysis and inter¬ 
pretation of high-dimensional clinical studies. 

Classification is a well-investigated problem in machine learning. 
This chapter gives a brief overview of the most fundamental issues 
baring microarray gene expression analysis in mind as a prominent 
example for the measurement of high-dimensional data. We refer to 
[2-6] for further reading on the more theoretical concepts of 
machine learning and to [7, 8] for an in-depth description of 
their application to microarray data. 

The task in classification is to determine classification rules 
which enable discrimination between two or more classes of objects 
based on a set of features. Supervised learning methods construct 
classification rules based on training data with known classes. They 
deduce rules by optimizing a classification quality criterion, for 
example, by minimizing the number of misclassifications. In 
microarray-based computational diagnostics, features are gene 
expression levels, objects are tissue samples from patients, and 
object classes are phenotypic characteristics of patients. Phenotypes 
can include previously defined disease entities, as in the Microarray 
Innovations in LEukemia study [9] and lymphoma-related studies 
[10, 11]; risk groups, as in the breast cancer studies of van’t Veer 
et al. [12]; treatment response, as in the leukemia study by Cheok 
et al. [13]; or disease outcome, as in the breast cancer study of West 
et al. [14]. In this context, classification rules are called diagnostic 
signatures. 

Study cases are always samples from a larger disease population. 
Such a population comprises all patients who had a certain disease, 
have it now, or will have it in the future. We aim for a diagnostic 
signature with good performance not only on the patients in the 
study but also in future clinical practice. That is, we aim for a 
classification rule that generalizes beyond the training set. Different 
learning algorithms determine signatures of different complexity. 
We illustrate signature complexity using a toy example in which 
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Fig. 1 Overfitting. The linear boundary (left-hand side) reflects characteristics of the disease population, 
including an overlap of the two classes. The complex boundary (right-hand side) does not show any 
misclassifications but adapts to noise. It is not expected to perform well on future patient data 

diagnosis of treatment response is based on the expression levels of 
only two genes (Fig. 1 ). A linear signature corresponds to a straight 
line, which separates the space defined by the two genes into two 
parts, holding good and bad responders, respectively. When expres¬ 
sion levels of many genes are measured, linear signatures corre¬ 
spond to hyperplanes separating a high-dimensional space into 
two parts. Other learning algorithms determine more complex 
boundaries. In microarray classification, however, the improvement 
achieved by sophisticated learning algorithms is controversial [15]. 
Complex models are more flexible to adapt to the data. However, 
they also adapt to noise more easily and may thus miss the charac¬ 
teristic features of the disease population. Consider Fig. 1, where 
black data points represent bad responders and white data points 
represent good responders. A linear signature is shown in the left 
panel of Fig. 1, while a complex boundary is drawn in the right 
panel. The linear signature reflects the general tendency of the data 
but is not able to classify perfectly. On the other hand, the complex 
boundary never misclassifies a sample, but it does not appear to be 
well supported by the data. When applying both signatures to new 
patients, it is not clear whether the complex boundary will outper¬ 
form the linear signature. In fact, experience shows that complex 
signatures often do not generalize well to new data, and hence 
break down in clinical practice. This phenomenon is called 
overfittinjj. 

1.2 The Curse of In microarray studies, the number of genes is always orders of 

Dimensionality magnitude larger than the number of patients. In this situation, 

overfitting also occurs with linear signatures. To illustrate why this 
is a problem, we use another simplistic toy example: two genes are 
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Fig. 2 Ill-posed classification problem. Even linear signatures are underdetermined when the number of 
patients does not exceed the number of genes. The black and white data points represent the training data. 
The linear signatures in both panels are equally valid but classify the novel data point (gra]/j differently 

measured in only two patients. This is the simplest scenario where 
the number of patients in a study does not exceed the number of 
genes. We want to construct a linear signature that can discriminate 
between the two classes, say good and bad responders, each repre¬ 
sented by one patient. This is the same problem as finding a straight 
line separating two points in a plane. Clearly, there is no unique 
solution (see Fig. 2). Next, think about a third point, which does 
not lie on the line going through the first two points. Imagine it 
represents a new patient with unknown diagnosis. The dilemma is 
that it is always possible to linearly separate the first two points such 
that the new one lies on the same side as either one of them. No 
matter where the third point lies, there is always one signature with 
zero training error, which classifies the new patient as a good 
responder, and a second signature, equally well supported by the 
training data, which classifies the new patient as a bad responder. 
The two training patients do not contain sufficient information to 
diagnose the new patient. We are in this situation whenever there 
are at least as many genes as training patients. Due to the large 
number of genes on microarrays this problem is inherent in gene 
expression studies. 

The way out of the dilemma is regularization. Generally speaking, 
regularization means imposing additional criteria in the signature 
building process. A prominent method of regularization is gene selec¬ 
tion , which restricts the number of genes contributing to the signa¬ 
ture. This can be biologically motivated, since not all genes carry 
information on disease states. In many cases, gene selection improves 
the predictive performance of a signature [16]. Furthermore, ade¬ 
quate selection of genes opens the opportunity to design smaller and 
hence cheaper diagnostic microarrays or marker panels [17]. 
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Often genes are selected independently of the classification 
algorithm according to univariate selection criteria. This is called 
gene filtering [18]. In the context of classification, selection criteria 
reflecting the correlation with the class labels are generally used. 
Hence one favors genes with low expression levels in one class and 
high expression levels in the other. Popular choices are variants of 
the ^-statistic or the nonparametric Wileoxon mnk sum statistic. 

1.3 Calibrating As illustrated in the previous sections, regularization is essential in 

Model Complexity the process of determining a good diagnostic signature. Aggressive 

regularization, however, can be as harmful as too little of it. Hence 
regularization needs to be calibrated. One way to do so is to vary 
the number of genes included in the signature: weak regularization 
means that most genes are kept, whereas strong regularization 
removes most genes. 

With little regularization, classification algorithms fit very flexi¬ 
ble decision boundaries to the data. This results in few misclassifica- 
tions on the training data. Nevertheless, due to overfitting, this 
approach can have poor predictive performance in clinical practice. 
With too much regularization, the resulting signatures are too 
restricted. They have poor performance on both the study patients 
and in future clinical practice. We refer to this situation as under - 
fitting. The problem is schematically illustrated in Fig. 3, where two 
error rates are compared. First there is the training error, which is 
the number of misclassifications observed on data from which the 
signature was learned. In addition, there is a test error, which is 
observed on an independent test set of additional patients ran¬ 
domly drawn from the disease population. Learning algorithms 
minimize the training error, but the test error measures whether a 
signature generalizes well. 



low high 


model complexity 

Fig. 3 Overfitting-underfitting trade-off. The x-axis codes for model complexity and the y-axis for error rates. 
The dashed line displays the training error, the solid line the test error. Low complexity models (strong 
regularization) produce high test errors (underfitting) and so do highly complex models (weak regularization, 
overfitting) 
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1.4 Evaluation of 
Diagnostic Signatures 


2 Software 


2.1 Classification 
Software 


In order to learn signatures that generalize well, we adapt 
model complexity so that test errors are minimized. To this end, 
we have to estimate test errors on an independent set of patients, 
the calibration set, which is not used in signature learning. To 
calibrate regularization, we learn signatures of varying complexity, 
evaluate them on the calibration set, and pick the signature that 
performs best. 

Validation of a diagnostic signature is important, because the errors 
on a training set do not reflect the expected error in clinical practice. 
In fact, the validation step is most critical in computational diagno¬ 
sis studies and several pitfalls are involved. Estimators can be overly 
optimistic (biased) or they can have high sample variances. It also 
makes a difference whether we are interested in the performance of 
a fixed signature (which is usually the case in clinical studies), or 
whether we are interested in the power of the learning algorithm 
that builds the signatures (which is usually the case in methodolog¬ 
ical projects). The performance of a fixed signature varies due to the 
random sampling of the test set, while the performance of a 
learning algorithm varies due to sampling of both training and 
test set. 

In computational diagnostics, we are usually more interested in 
the evaluation of a fixed diagnostic signature. The corresponding 
theoretical concept is the conditional error rate, also called true 
error. It is defined as the probability of misclassifying new patients 
given the training data used in the study. The true error is not 
obtainable in practice, since its computation involves the unknown 
population distribution. Estimates of the true error rate, however, 
can be obtained by evaluating the diagnostic signature on indepen¬ 
dent test samples. 


Implementations and corresponding documentation for most 
computational methods we describe here can be found in packages 
contributed to the statistical computing environment R [19, 20] 
and are available from http://cran.R-project.org. In addition, the 
Bioconductor project [21] containing many R-packages related to 
the life sciences is found at the link http://www.bioconductor.org. 

LDA, QDA, and many other classification methods are implemen¬ 
ted in the R-package MASS from the VR bundle; DLDA is 
contained in the R-package supclust. An implementation of support 
vector machines (SVMs) is part of the package el071 ; svmpath can 
compute the entire regularization path for SVMs. The nearest 
centroid method is implemented in the Bioconductor package 
pamr. The package MCRestimate from the Bioconductor project 
implements many helpful functions for nested cross-validation. 
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2.2 Additional 
Software Related to 
RNA Microarrays 


2.3 Additional 
Software Related to 
RNA Sequencing 


2.4 Additional 
Software Related to 
NMR Metabolite 
Measurements 


3 Methods 


3.1 Notation 


The Bioconductor package oligo or aroma, affymetrix can be used 
to generate logarithmic data on an additive scale from microarray 
measurements. These packages focus on measurements from Affy- 
metrix microarrays and can also be used to bring patient profiles to 
a common scale. A normalization alternative is provided in the 
Bioconductor package vsn. 

TopHat2 is one of many suggested tools independent of R to map 
RNA sequence reads to genomes or transcriptomes for generating 
count data. Bioconductor packages edgeR and DESeq2 can be used 
to derive normalized expression values from RNA-seq count data. 
The R package PoiClaClu implements sparse Poisson Linear Dis¬ 
criminant Analysis (sPLDA) and is available from CRAN. 

For the analysis of NMR spectra and binning of these the software 
package AMIX (Bruker Biospin GmbH, Rheinstetten, Germany) 
may be used. Normalization methods developed for microarrays 
can also normalize data from NMR bins. Suitable choices are 
variance stabilization and normalization (from the package vsn) or 
quantile normalization (from the package limma). In metabolite 
classification, random forests and support vector machines as 
implemented in the R-packages randomRorest and el071 , respec¬ 
tively, performed particularly well. 


Here we describe our choice of methods for the development of a 
diagnostic signature when data is generated using microarrays. The 
suggested methods, however, can also be applied and used likewise 
on other high-dimensional data (see Note 1). In addition to nor¬ 
malized expression profiles, patients have an attributed class label 
reflecting a clinical phenotype. The challenge is to learn diagnostic 
signatures on gene expression data that enable prediction of the 
correct clinical phenotype for new patients. 

We measure p genes on n patients. The data from the microarray 
corresponding to patient i is represented by the expression profile 
. We denote the label indicating the phenotype of patient i 
by y^{— 1, +1}. In this setting, the class labels are binary and we 
restrict our discussion to this case. However, the discussed methods 
can be extended to multiclass problems dealing with more than two 
clinical phenotypes. The profiles are arranged as rows in a gene 
expression matrix X&R n * p . All labels together form the vector 
y = n • The pair (X, y) is called a dataset D. It holds all 

data of a study in pairs of observations {}, 
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3.2 Diagonal Linear 
Discriminant Analysis 
(DLDA) 


The computational task is to generate a mathematical model / 
relating x to y. 

Although we never have access to the complete disease popula¬ 
tion, it is convenient to make it part of the mathematical formalism. 
We assume that there is a data-generating distribution P(X, Y) on 

x { — 1, +1}. P(X, T) is the joint distribution of expression pro¬ 
files and associated clinical phenotypes. The patients who enrolled 
for the study, as well as new patients who need to be diagnosed in 
clinical practice, are modeled as independent samples {(x^\ y *•)} 
drawn from P We denote a diagnostic signature, or classification 
rule , by/ : W —> { —1,+1}. 

Various learning algorithms have been suggested for microarray 
analysis. Many of them implement rather sophisticated approaches 
to model the training data. However, Wessels et al. [15] report that 
in a comparison of the most popular algorithms and gene selection 
methods, simple algorithms perform particularly well. They have 
assessed the performance of learning algorithms using six datasets 
from clinical microarray studies. Diagonal linear diseriminant 
analysis (DLDA) combined with univariate gene selection achieved 
very good results. This finding is in accordance with other authors 
(e.g., [16]). Thus, we recommend and describe this combination of 
methods. 

Diagonal linear diseriminant analysis (DLDA) is based on a 
comparison of multivariate Gaussian likelihoods for two classes. 
The conditional density P(x\y = e ) of the data given membership 
to class c^{— 1, +1} is modeled as a multivariate normal distribution 
N(fi\ Z c ) with class mean \x and covariance matrix Z c . The two 
means and covariance matrices are estimated from the training data. 
A new point is classified to the class with higher likelihood. Restric¬ 
tions on the form of the covariance matrix control model complex¬ 
ity: Quadratic discriminant analysis (QDA) allows different 
covariance matrices in both classes; linear diseriminant analysis 
(LDA) assumes that they are the same, and diagonal linear discrim¬ 
inant analysis (DLDA) additionally restricts the covariance matrix 
to diagonal form. The parameters needed by DLDA are therefore 
class-wise mean expression values for each gene and one pooled 
variance per gene. 

To derive the decision rule applied in DLDA, consider the 
multivariate Gaussian log-likelihood N(ja\ Z) for class e with diag¬ 
onal covariance matrix Z. The vector /u c contains the mean gene 
expression values gf for class e. It is also called the elass eentwid. 
The covariance matrix Z contains pooled gene-wise variances o ?. 
The log-likelihood L c (x) can then be written in the form: 
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L c (*) = 4 ■ Z) lo g( 2 *4) -rE — ^ • 

*=1 Z =1 

The first term of L c (x) does not depend on the class and is therefore 
neglected in DLDA. DLDA places patients into the class for which 
the absolute value of the second term is minimized. 


3.3 Univariate Gene 
Selection 


Diagonal linear discriminant analysis can be directly applied to 
microarray data. Nevertheless, gene selection considerably improves 
its performance. In gene selection, we impose additional regulari¬ 
zation by limiting the number of genes in the signature. A simple 
way to select informative genes is to rank them according to a 
univariate criterion measuring the difference in mean expression 
values between the two classes. We suggest a regularized version of 
the t-statistie also used for detecting differential gene expression. 
For gene z, it is defined as 


d'i — 


Pj 1 - pf_ 

Gi + G 0 


where o 0 denotes the statistic’s regularization parameter, the so- 
called fudge factor. It is typically set to the median of all <7* and 
ensures that the statistic does not grow exceedingly when low 
variances are underestimated. Only top-ranking genes in the result¬ 
ing list are chosen for classification. 


3.4 Generation of In this section, we suggest a simple procedure to generate and 
Diagnostic Signatures validate a diagnostic signature within a clinical microarray study. 
It is illustrated in Fig. 4. 


3.4.1 Preprocess Your First, normalize your microarray data in order to make the expres- 
Data sion values comp arable .Various methods for microarray data pre¬ 

processing and normalization have been suggested and are equally 
valid in computational diagnostics. From now on, we assume that 



Fig. 4 Overview of the suggested procedure to generate a diagnostic signature. K represents the regularization 
level (e.g., number of genes selected), K 0] pt denotes the best regularization level found in cross-validation. L 
represents the training set and E the test set 
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3.4.2 Divide Your Data 
into a Training and a Test 
Set 


3.4.3 Find the Best 
Regularization Level 


gene expression profiles are normalized and on an additive scale 
(log transformed). 

In order to allow an unbiased evaluation of the generated diagnos¬ 
tic signature, we suggest separating the study data right from the 
start into two parts: the training set L for learning and the test set E 
for evaluation. The entire learning process must be executed on the 
training data only. The evaluation of the resulting diagnostic signa¬ 
ture must be performed afterward using only the test set. 

The training set’s exclusive purpose is to learn a diagnostic 
signature. Unclear or controversial cases can and should be 
excluded from learning. You may consider focusing on extreme 
cases. For instance, Liu et al. improve their outcome prediction 
by focusing the learning phase on very short and very long-term 
survivors [22]. 

The test set’s exclusive purpose is to evaluate the diagnostic 
signature. Since you want to estimate the performance of a signa¬ 
ture in clinical practice, the test set should reflect expected popula¬ 
tion properties of the investigated disease. For example, if the 
disease is twice as common in women as in men, the gender ratio 
should be close to 2:1 in the test set too. 

The size of the training and test sets has an impact on both the 
performance of the signature and the accuracy of its evaluation. 
Large training sets lead to a better performance of the signature, 
while large test sets lead to more accurate estimates of the perfor¬ 
mance. Actually, small test sets result in unreliable error estimation 
due to sampling variance (see Note 2). We recommend splitting the 
data in a ratio of 2:1. 

Search for the best regularization level exclusively using the training 
data. Apply univariate gene filtering and DLDA to learn signatures 
with varying complexity and estimate their performance on inde¬ 
pendent data. Your best choice for the regularization level is the 
one leading to the best performance. See Note 3 for alternative 
learning algorithms and feature selection schemes. 

To estimate performance on independent data we recommend 
tenfold cross-validation. Partition the training set into 10 bins of 
equal size. Take care to generate bins that are balanced with respect 
to the classes to be discriminated. That is, classes should have the 
same frequency in the bins as in the complete training data. Use 
each bin in turn as the calibration set, and pool the other bins to 
generate the learning set. In each iteration, select genes according 
to the regularized t-statistics d t and the regularization level K to be 
evaluated, learn a signature by applying DLDA on the restricted 
learning set, and compute the number of misclassifications in the 
calibration bin. The cross-validation error is the sum of these errors. 
Use this estimate for performance on independent data to 
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3.4.4 Learn the 
Diagnostic Signature 


3.4.5 Evaluate Your 
Signature 


3.4.6 Document Your 
Signature 


1. Choose regularization level K. 

2. Separate the training set Linto 10 bins L l ,...,L w 

3. For i in 1 to 10 do 

a. Select features Saccording to K from d[l~l] 

b. Learn f,(K) on D[L-L t ,s] 

c. Evaluate /(X)on d[l,s] 

4. Average error rates determined in step 3.c 

Fig. 5 Algorithmic representation of the tenfold cross-validation procedure. K is 
the regularization level to be evaluated. D[L] denotes the study data restricted to 
patients in the set L. D[L, S\ represents the study data restricted to the patient 
set L and the gene set S. The operator is used for set difference 

determine the optimal amount of regularization [23,24]. SeeFig. 5 
for an algorithmic representation of the cross-validation procedure. 

A straightforward method for finding a good set of candidate 
values for K is forward filtering. Starting with the most discrimi¬ 
nating gene, include additional genes one by one until the cross- 
validation error reaches an optimum. Stop iterating and set the 
optimal regularization level K opt to the value of K that produced 
the smallest cross-validation error. 

For the optimal level of regularization IC opt , compute the diagnostic 
signature f on the complete training set L. This is the final 
signature. 

Apply your final signature / to the test set E to estimate the mis- 
classification rate. Note that the result is subject to sampling vari¬ 
ance ( see Note 4 for more information on sampling variance). 

Diagnostic signatures eventually need to be communicated to other 
healthcare centers. It should be possible for your signature to be 
used to diagnose patients worldwide, at least if the same measure¬ 
ment technology and platform is used to profile them. You should 
therefore provide a detailed description of the signature that you 
propose. The list of genes contributing to the signature is not 
enough. The mathematical form of both the classification and the 
preprocessing models needs to be specified together with the values 
of all parameters. For DLDA, communication of the centroid for 
each group and the variance for each gene is necessary to specify the 
signature completely. While exact documentation of the 
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classification signature is crucial, involvement of genes in a signa¬ 
ture should not be interpreted biologically (see Note 5). 


4 Notes 


1. Alternative High-Dimensional Data for Classification : So far, 
we have focused on classification based on microarray data. 
Various new technologies have been developed to generate 
high-dimensional multivariate data for profiling in the life 
sciences. Data generated from these new technologies can 
often be used for classification in a similar manner as just 
described for microarray data. 

Custom micro arrays: General purpose genome-wide microar¬ 
rays are too complex and expensive for routine diagnosis in 
hospitals. Therefore, clinicians are interested in cost and time 
effective, but officially approved tools to reliably diagnose 
patients using high-dimensional gene expression measure¬ 
ments. For instance, the custom AmpliChip Leukemia micro¬ 
array (by Roche Molecular Systems) was specifically designed 
for the classification of various types of leukemia [25]. The chip 
contains 1480 distinct probe sets with 1457 of them used for 
generating normalized signal intensities of disease-related 
genes; the remaining probe sets interrogate control sequences 
and housekeeping genes. The genes on the AmpliChip are 
selected based on the MILE study on 2096 leukemia patients 
using the large genome-wide standard Affymetrix microarray 
HG-U133 plus 2.0 \ 9]. The AmpliChip is intended to identify 
the correct type of leukemia for new patients. Thereby, it is 
limited to a preselected set of leukemias. Because the Ampli¬ 
Chip is based on the Affymetrix microarray technology, a clas¬ 
sifier could be generated and applied for classification as 
discussed earlier. At the same time, the AmpliChip comes 
with the advantages of the Affymetrix microarray technology 
such as good reproducibility, as well as drawbacks including 
complex handling in the laboratory. 

Digital multiplexed gene expression : One platform for digital 
gene expression measurement by capturing and counting indi¬ 
vidual mRNA transcripts is the NanoString nCounter gene 
expression system. The developers report that the nCounter 
system is limited to the measurement of about 500 human 
genes, has a detection limit between 0.1 fM and 0.5 £M, and 
a linear dynamic range of over 500-fold [26]. Hence, this 
platform is limited to a subset of genes but can measure very 
small amounts of material without amplification. Therefore, 
the nCounter technology is more robust to partially degraded 
RNA, which is expected in formalin-fixed paraffin-embedded 
(FFPE) material. 
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In the vast majority of hospitals, patient material is only used 
and stored as FFPE material. Researchers have confirmed 
microarray-based signatures for classification of diffuse large 
b-cell lymphomas with the nCounter technology in FFPE 
material [27, 28]. In both projects, researchers have measured 
less than 100 genes and therefore needed particular protocols 
for normalization. Masque-Soler et al. [27] performed batch- 
to-batch correction with sample-wise normalization factors 
generated by individually averaging the internal controls. Sam¬ 
ples were then normalized independently with quantile nor¬ 
malization. Scott et al. [28] normalized each array by dividing 
the observed expression values by the geometric mean of five 
housekeeping genes. In addition, a reference array of synthetic 
oligonucleotides of fixed concentration ratios was included in 
each run, to adjust for potential batch effects between runs. In 
both cases, the expression values were finally log2-transformed. 
Masque-Soler et al. as well as Scott et al. emphasize the simple 
and efficient protocol for the gene expression measurement in 
only 24-36 hours. 

Classification on High-Throughput RNA Sequencing Data : 
High-throughput RNA sequencing is on its way to replace 
microarray transcrip tome analyses for many applications. How¬ 
ever, fundamental differences between the data delivered from 
microarrays and data from RNA sequencing experiments must 
be respected during data analysis. For high-throughput RNA 
sequencing, first a cDNA library is prepared from RNA input 
material, which is then PCR amplified and sheared into frag¬ 
ments of similar size. These fragments are then sequenced from 
one or both sides, giving rise to millions of single- or paired end 
sequence reads. Strand-specific protocols can recover whether 
the transcript originated from the plus or minus DNA strand, 
while strand-unspecific protocols can not. Sequence reads are 
then aligned to the genome with a splicing-aware mapping tool 
such as TopHat2 [29], or de novo assembled if sequencing 
depth is sufficient. For most gene expression analyses of well- 
annotated species, the first-align-then-annotate strategy that 
uses genomic sequence information and respects splicing 
events is preferred, and a number of mappers have been pro¬ 
posed for this task. From the mapped sequence reads, so-called 
count tables that summarize the number of observed reads for 
all features of interest (e.g., exons or genes) can be generated. 
While expression values from microarrays are on a continuous 
scale, reflecting fluorescence intensities with a distribution that 
is close to log-normal, count data is Poisson or negative- 
binomial distributed. To be able to apply classification algo¬ 
rithms to sequence data, one must either (1) adapt the data 
to use models that assume continuous and log-normal 
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high-throughput data, or (2) adapt the models such that they 
assume Poisson or negative binomial distributed data. In the 
following paragraphs, we discuss these two approaches. 

Deriving expression values from sequencing data: When high- 
throughput sequencing technology first emerged, this was 
considered the end of the microarray era. It was hoped that 
RNA sequencing would deliver ‘digital’ gene expression esti¬ 
mates in an unbiased and transcrip tome-wide fashion that is 
independent of annotation [30]. But from many experiments 
conducted so far, it became evident that the sequence library 
preparation as well as the fact that short sequence reads (for¬ 
merly 25 nucleotides, now up to 150 nucleotides in single or 
paired end mode, only for some technologies several hundred 
or even a few thousand nucleotides, albeit with lower numbers 
of total reads) are sequenced, can introduce substantial biases 
in observed read counts [ 31 ]. A substantial fraction of the short 
sequence reads are often not uniquely mappable to a genomic 
region due to sequence similarities caused by, e.g., gene 
families. But even for the uniquely mapped reads it might not 
be obvious from which transcript they originated, when the 
locus gives rise to more than one transcript. Therefore, estimat¬ 
ing transcript and gene expression levels from RNA sequencing 
data is not trivial. Simplistic approaches count only uniquely 
mapped reads falling in exonic regions, disregarding the rest. 
These counts can then be used to derive differentially expressed 
genes and exons but hold no information on the transcript 
level. Resolution on the transcript level requires allocation of 
reads with multiple mapping sites—both within the same and 
across genes—to specific isoforms. Many algorithms have been 
proposed, the majority assign multimapping reads to isoforms 
in an iterative process, which has first been described in Xing 
et al [32]. These models allow inference about relative tran¬ 
script abundance. 

Numerous gene expression normalization methods for 
RNA-sequencing data have been proposed, the most popular 
(but also criticized) being RPKM (Reads per Kilobase of tran¬ 
script per Million mapped reads) [33] and FPKM (Fragments 
Per Kilobase of transcript per Million mapped reads) [34], the 
later counting fragments from which the reads originated, to 
avoid double counting when having paired-end sequencing 
data. Modifications such as TPM (Transcripts Per Million) 
[35] claim to better represent relative molar RNA concentra¬ 
tion in the sample, the initial aim of the RPKM value. Software 
packages written for finding differentially expressed genes from 
RNA sequencing data usually work directly with the count data 
and often use Negative Binomial models and some form of 
dispersion shrinkage, as in DESeq2 [36] and edgeR [37]. 
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However, both DESeq2 and edjyeR offer functionality to 
retrieve expression values. In DESeq2 , one can apply a ridge 
regularization to transform the count data to log2 scale, reflect¬ 
ing a normalized expression value. This rlog transformation 
uses library size factors and minimizes differences between 
samples for rows with small counts. In edgeR , fitted values 
from the log-link negative binomial or Poisson generalized 
linear model (GLM) that are used for differential expression 
analysis can be extracted as moderated log counts per million 
(CPM). Once gene or transcript expression levels are at hand, 
the classification algorithms and procedures described earlier in 
this chapter can be applied. 

Classification with count data : An alternative approach to first 
deriving expression estimates from RNA sequencing data is to 
directly work on the count data but adapt the classification 
procedure accordingly. For the identification of differentially 
expressed genes, methods have adapted models for count data, 
but classification has been lagging behind so far, with currently 
only one proposed method for RNA sequencing data, namely, 
the Poisson linear discriminant analysis (PLDA) classifier [38]. 
For classification using PLDA, the authors propose to normal¬ 
ize the count data with an estimate of the size factor, which 
represents the sequencing library size, followed by a power 
transformation to account for overdispersion relative to the 
Poisson model, meaning that the variance is larger than the 
mean. Instead of assuming normally distributed data as in 
standard linear discriminant analysis (LDA), Poisson linear 
discriminant analysis assumes independent features from 
which the data follow a Poisson distribution. The classification 
rule that assigns test samples to a specific class is then linear. 
Since it is not desirable to use all features of a high-dimensional 
dataset for classification, sparse PLDA (sPLDA) uses shrinkage 
similar to Nearest Shrunken Centroids (NSC) classifiers that 
are also used by PAM for classification of microarray data (see 
Note 3). sPLDA has been shown to perform well on slightly 
overdispersed data relative to the Poisson model, but perfor¬ 
mance decreases on severely overdispersed data. The authors of 
sPLDA propose to investigate whether negative-binomial 
models might increase performance for later case datasets. 

Metabolomics : The aim of metabolomics is primarily the com¬ 
prehensive analysis of the flow of small organic compounds 
through bioenergetic and biosynthetic pathways by their quan¬ 
titative analysis in cells, tissues, organs, biological fluids, and 
whole organisms. Typical compounds include amino acids, 
sugars, organic acids, bases, lipids, vitamins, and various con¬ 
jugates of substances of exogenous origin. Fields of application 
include such diverse topics as investigating the health status of 
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dairy cows [39] or analysis of chronic kidney diseases [40]. 
Metabolomic investigations are mainly conducted by employ¬ 
ing hyphenated mass spectrometry or nuclear magnetic reso¬ 
nance (NMR) spectroscopy. Here, we will focus on the 
application of solution NMR spectroscopy. NMR is a versatile 
and powerful method for metabolite identification and quanti¬ 
fication, as it allows the simultaneous detection of all proton- 
containing metabolites present at least micromolar concentra¬ 
tions in a given sample. NMR signal volumes scale linearly with 
concentration enabling accurate quantification of analytes. Fur¬ 
thermore, NMR requires very little sample pretreatment and, 
typically, no prior chemical derivatization of molecules. On the 
other hand, a disadvantage of NMR spectroscopy when com¬ 
pared to mass spectrometry, for example, is its relatively poor 
sensitivity. 

NMR spectroscopy : The theory of NMR spectroscopy is well 
established. For a comprehensive description, see for example 
Ernst et al. [41]. For observing an NMR signal nuclei such as 
protons possessing a spin unequal to zero are brought into a 
static magnetic field, where only certain orientations, i.e., cer¬ 
tain states of the magnetic moments of the nuclei are allowed. 
By applying an additional electromagnetic field of appropriate 
frequency, transitions between the different states can be 
induced. This will lead to a time-dependent change of the 
macroscopic magnetization of the nuclei which will be 
recorded in the time domain as the NMR signal. By application 
of Fourier transform to the NMR signal the final NMR spec¬ 
trum will be obtained. The positions of the signals in an NMR 
spectrum depend on the electronic environment of the nuclei. 
As the electronic environment depends among other factors on 
the type of molecule and on the position of a nucleus within a 
molecule, in the final NMR spectrum each signal corresponds 
to a certain nucleus or group of nuclei of a certain molecule. By 
comparison with reference spectra obtained from pure com¬ 
pounds, identification of metabolites becomes feasible. Signal 
intensities scale linearly with metabolite abundance and can 
thus be quantified relative to a given reference compound. An 
example of a typical NMR spectrum of human urine, in which 
the signals of up to several hundred different compounds may 
be detected, is shown in Fig. 6. 

Preprocessing: Acquisition of NMR spectra is generally followed 
by multivariate statistical data analysis. Here, the first step 
involves correction for variation in signal position across spec¬ 
tra due to differences in pH, salt concentration, and/or sample 
temperature. A widely used and robust method to compensate 
for these effects is spectral binning, where an NMR spectrum is 
split into a number of segments or bins. Equally sized bins are 
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Fig. 6 ID ^ NMR spectrum of human urine. As an example for metabolite identification the signals of 
hippurate and creatinine are marked. The enlarged region demonstrates the high complexity of the spectrum 

mostly used, albeit other schemes such as adaptive binning have 
been proposed. Data points inside every bin are integrated so 
that the whole spectrum is then represented as a vector of 
bucket integrals. Alternative approaches include, for example, 
signal alignment techniques [42]. 

Data normalization : Generally, metabolomic datasets are 
prone to unwanted experimental and/or biological variances 
and biases. To minimize these disturbing factors, data normali¬ 
zation and scaling approaches may be used. The different stra¬ 
tegies can be grouped into methods that either adjust the 
variance of metabolites by variance stabilization and variable 
scaling strategies, or remove unwanted sample to sample varia¬ 
tions. A simple but often effective approach for reducing inter¬ 
sample variations of urinary data is scaling relative to the signal 
of creatinine. For other sample matrices scaling of every spec¬ 
trum to a total sum of one may be used. Under the condition 
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that only a relatively small proportion of metabolites is regu¬ 
lated in approximately equal shares up and down so that no 
large systematic differences in total spectral area between 
biological groups exist, Variance Stabilization and Normaliza¬ 
tion [43] as well as Quantile Normalization [44] give reliable 
results according to our experience [45]. 

Sample classification : Classification of an unknown sample to 
known classes of disease (e.g., healthy or diseased) is a typical 
application of metabolomics. Generally, classification algo¬ 
rithms are trained on a training dataset where the class label 
of each sample is known, followed by application of the trained 
algorithm to additional independent test data. In case that test 
data are difficult to obtain, performance evaluation is often 
performed within a cross-validation setting, where the whole 
dataset is iteratively split into training and test sets. By employ¬ 
ing a nested cross-validation scheme, where parameters rele¬ 
vant for the algorithm are optimized within inner loops, it is 
ensured that results are not biased by training or parameter 
optimization (see Note 4). In our experience, especially, Ran¬ 
dom Forests [46] and Support Vector Machines [47—49] are 
particularly suited for the analysis of high-dimensional NMR 
data [50]. 

Conclusion: Many of the aforementioned methods were origi¬ 
nally developed for analysis of gene expression micro-array data 
and they might as well be applied to other high-dimensional 
data such as metabolomic data generated by means of 
hyphenated mass spectrometry as well as to proteomic datasets. 

2. Pitfalls in Signature Evaluation : There are several pitfalls lead¬ 
ing to overly optimistic estimations, for instance, using too 
small or unbalanced validation sets [51]. When using a single 
independent test set for evaluation of diagnostic signatures, 
only training data is used for gene selection, classifier learning, 
and adaptive model selection. The final signature is then eval¬ 
uated on the independent test set. Unfortunately, this estima¬ 
tor can have a substantial sample variance, due to the random 
selection of patients in the test set. This is especially the case if 
the test set is small. Thus, good performance in small studies 
can be a chance artifact. For instance, Ntzani et al. [52] have 
reviewed 84 microarray studies carried out before 2003 and 
observed that positive results were reported strikingly often on 
very small datasets. 

Another prominent problem is the selection bias caused by 
improperly combining cross-validation with gene selection 
[14, 51,53]. Gene selection has a strong impact on the predic¬ 
tive performance of a signature. It is an essential part of the 
signature-building algorithm. There are two possible ways to 
combine gene selection with cross-validation: either apply gene 
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selection to the complete dataset and then perform cross- 
validation on the reduced data, or perform gene selection in 
every single step of cross-validation anew. We call the first 
alternative out-of-loop gene selection and the second in-loop 
gene selection. In-loop gene selection gives the better estimate 
of generalization performance, while the out-of-loop proce¬ 
dure is over optimistic and biased toward low error rates. In 
out-of-loop gene selection, the genes selected for discrimina¬ 
tive power on the whole dataset bear information on the sam¬ 
ples used for testing. In-loop gene selection avoids this 
problem. Here, genes are only selected on the training data of 
each cross-validation iteration and the corresponding test sets 
are independent. 

The impact of over optimistic evaluation through out-of¬ 
loop gene selection can be very prominent. For example, Reid 
et al. [54] report the reevaluation of two public studies on 
treatment response in breast cancer. They observe classification 
errors of 39 % and 46 %, respectively, when applying in-loop 
gene selection. Using the overoptimistic out-of-loop gene 
selection, error rates are underestimated at 25 % and 24 %, 
respectively. Similarly, Simon et al. [51] describe a case in breast 
cancer outcome prediction where the out-of-loop cross- 
validation method estimates the error rate to 27 % while the 
in-loop cross-validation method estimates an error rate of 
41 %. Nevertheless, Ntzani et al. [52] report that 26 % of 84 
reviewed microarray studies published before 2003 provide 
overop timistic error rates due to out-of-loop gene selection. 
Seven of these studies have been reanalyzed by Michiels et al. 
[55]. They determine classification rates using nested cross- 
validation loops and average over many random cross- 
validation partitionings. In five of the investigated datasets, 
classification rates no better than random guessing are 
observed. 

3. Alternative Classification Algorithms: Several authors have 
observed that complex classification algorithms quite often do 
not outperform simple ones like diagonal linear discrimination 
on clinical microarray data [15, 16, 29, 56]. We report two 
more algorithms, however, which have shown good perfor¬ 
mance on microarray data. 

The first one is a variant of DLDA called Prediction Analysis 
of Microarrays (PAM , [57, 58]). An outstanding feature of 
PAM is gene shrinkage. Filtering uses a hard threshold: when 
selecting k genes, gene k+1 is thrown away even if it bears as 
much information as gene k. Gene shrinkage is a smoother, 
continuous, soft-thresholding method. PAM is an application 
of nearest centroid classification, in which the class centroids 
are shrunken in the direction of the overall centroid. For each 
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gene i the value 8i c measures the distance of the centroid for 
class e to the overall centroid in units of its standard deviation. 
Each 8i c is then reduced by an amount A in absolute value and is 
set to zero if its value becomes negative. With increasing A, all 
genes lose discriminative power and more and more of them 
will fade away. Genes with high variance vanish faster than 
genes with low variance. A link between PAM and classical 
linear models is discussed in Huang et al. [59]. 

Support vector machines (SVMs, [2, 47-49]) avoid the ill- 
posed problem shown in Fig. 2 in Subheading 1.2 by fitting a 
maximal (soft) margin hyperplane between the two classes. In 
high-dimensional problems there are always several perfectly 
separating hyperplanes, but there is only one separating hyper¬ 
plane with maximal distance to the nearest training points of 
either class. Soft margin SVMs trade off the number of mis- 
classifications with the distance between hyperplane and near¬ 
est data points in the training set. This trade-off is controlled by 
a tuning parameter C. The maximal margin hyperplane can be 
constructed by means of inner products between training 
examples. This observation is the key to the second building 
block of SVMs: the inner product x? Xj between two training 
examples Xj and Xj is replaced by a nonlinear kernel function. 
The use of kernel functions implicitly maps the data into a high¬ 
dimensional space, where the maximal margin hyperplane is 
constructed. Thus, in the original input space, boundaries 
between the classes can be complex when choosing nonlinear 
kernel functions. In microarray data, however, choosing a lin¬ 
ear kernel and thus deducing linear decision boundaries usually 
performs well. 

A gene selection algorithm tailored to SVMs is recursive 
feature elimination [60]. This procedure eliminates the feature 
(gene) contributing least to the normal vector of the hyper¬ 
plane before retraining the support vector machine on the data 
excluding the gene. Elimination and retraining are iterated 
until maximal training performance is reached. In high¬ 
dimensional microarray data, eliminating features one by one 
is computationally expensive. Therefore, usually several fea¬ 
tures are eliminated in each iteration. 

4. Alternative Evaluation Schemes : The signature evaluation sug¬ 
gested in Subheading 3.4 yields an estimate of the signature’s 
misclassification rate but does not provide any information on 
its variability. In order to investigate this aspect, you can apply 
methods designed to evaluate signature generating algorithms. 
In order to evaluate the performance of a learning algorithm, 
the sampling variability of the training set has to be taken into 
account. One approach is to perform the partitioning into 
training and test set (L and E) many times randomly and 


High-Dimensional Profiling for Computational Diagnosis 225 


execute the complete procedure illustrated in Fig. 4 for each 
partitioning. This yields a distribution of misclassification rates 
reflecting sample variability of training and test sets. More 
effective use of the data can be made via cross-validation. 
Together with the procedure described in Subheading 3.4, 
two nested cross-validation loops are needed [24, 61]. Braga- 
Neto and Dougherty [62] advise averaging cross-validation 
errors over many different partitionings. Ruschhaupt et al. 
[61] and Wessels et al. [15] implement such complete valida¬ 
tion procedures and compare various machine learning 
methods. 

In the leave-one-out version of cross-validation, each sample 
is used in turn for evaluation while all other samples are attrib¬ 
uted to the learning set. This evaluation method estimates the 
expected error rate with almost no bias. For big sample sizes, it 
is computationally more expensive than tenfold cross- 
validation and suffers from high variance [5, 7, 63]. Efron 
et al. [64] apply bootstrap smoothing to the leave-one-out 
cross-validation estimate. The basic idea is to generate different 
bootstrap replicates , apply leave-one-out cross-validation to 
each and then average results. Each bootstrap replicate contains 
n random draws from the original dataset (with replacement so 
that samples may occur several times). A result of this approach 
is the so-called 0.632+ estimator. It takes the possibility of 
overfitting into account and reduces variance compared to the 
regular cross-validation estimates. Ambroise et al. [53] have 
found it to work well with gene expression data. 

Cross-validation and bootstrap smoothing error rates, as 
well as error rates determined by repeated separation of data 
into training and test set, refer to the classification algorithm, 
not to a single signature. In each iteration, a different classifier 
is learnt based on different training data. The cross-validation 
performance is the average of the performance of different 
signatures. Nevertheless, cross-validation performance can be 
used as an estimate of a signature’s expected error rate. 

5. Biological Interpretation of Diagnostic Signatures: The meth¬ 
odology earlier was presented in the classification context only. 
In addition, you might be tempted to interpret the genes 
driving the models biologically, but this is dangerous. First, it 
is unclear how exactly regularization biases the selection of 
signature genes. While this bias is a blessing from the diagnostic 
perspective, this is not the case from a biological point of view. 
Second, signatures are generally not unique: While outcome 
prediction for breast cancer patients has been successful in 
various studies [65-67], the respective signatures do not over¬ 
lap at all. Moreover, Ein-Dor et al. [68] derived a large number 
of almost equally performing nonoverlapping signatures from a 
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single dataset. Also, Michiels et al. [55] report highly unstable 
signatures when resampling the training set. 

This is not too surprising considering the following: the 
molecular cause of a clinical phenotype might involve only a 
small set of genes. This primary event, however, has secondary 
influences on other genes, which in turn deregulate more genes 
and so on. In clinical microarray analysis we typically observe an 
avalanche of secondary or later effects, often involving 
thousands of differentially expressed genes. While complicating 
biological interpretation of signatures, such an effect does not 
compromise the clinical usefulness of predictors. On the con¬ 
trary, it is conceivable that only signals enhanced through 
propagation lead to a well generalizing signature. 
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Chapter 13 


Molecular Similarity Concepts for Informatics Applications 

Jurgen Bajorath 

Abstract 

The assessment of small molecule similarity is a central task in chemoinformatics and medicinal chemistry. 
A variety of molecular representations and metrics are applied to computationally evaluate and quantify 
molecular similarity A critically important aspect of molecular similarity analysis in chemoinformatics and 
pharmaceutical research is that one is typically not interested in quantifying the degree of structural or 
chemical similarity between compounds per se, but rather in extrapolating from molecular similarity to 
property similarity. In other words, one assumes that there is a correlation between calculated similarity and 
specific properties of small molecules including, first and foremost, biological activities. Although similarity 
is a priori a subjective concept, and difficult to quantify, it must computationally be assessed in a formally 
consistent manner. Otherwise, there is little utility of similarity calculations. Consistent treatment requires 
approximations to be made and the consideration of alternative computational similarity concepts, as 
discussed herein. 

Key words Molecular similarity and dissimilarity, Similarity-property principle, Similarity functions, 
Molecular descriptors, Fingerprints, Structure-activity relationships 


1 Introduction 


The assessment of molecular similarity is a key task in chemoinfor¬ 
matics [1-3] and also of central relevance for medicinal chemistry 
[3, 4]. Importantly, molecular similarity is qualitatively or quanti¬ 
tatively evaluated as an indicator of activity similarity. However, 
similarity is principally a subjective concept and no commonly 
applicable similarity criteria or rules exist [3]. In evaluating similar¬ 
ity relationships, many cognitive aspects play a role that often 
unconsciously determine human judgment. Regardless of whether 
similarity is assessed by humans or computationally, pattern recog¬ 
nition plays a central role in this process. To illustrate and further 
specify this key point, let us consider a quote from a recent work on 
molecular similarity analysis [3]: “ More than anything else , the rec¬ 
ognition of moleeular patterns , based on human or eomputational 
exploration , provides a basis for arriving at deeisions as to whether two 
eompounds are similar to each other or not. Since data complexity 
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generally seales with the number of patterns that ean be discovered , it 
quickly becomes impossible for humans to consider them in a compre¬ 
hensive manner. Therefore, humans intuitively, and often uncon¬ 
sciously, reduce patterns to simpler ones that contain the essential 
feature (s) of the original pattern. But unlike applications of compu¬ 
tational pattern recognition, the precise nature of these key patterns in 
human pattern recognition is unknown .” Importantly, the compu¬ 
tational assessment of molecular similarity stringently requires a 
consistent application of molecular representations and patterns 
derived from these representations as well as a consistent quantifi¬ 
cation of pattern correspondence in compounds under comparison. 
Furthermore, to quote one more time, we should consider the 
following [3]: “The key patterns used by humans or computers will 
generally vary from individual to individual or from algorithm to 
algorithm, a situation that most likely will yield results with varying 
degrees of agreement for the same set of data. This follows because the 
representations used by humans and by computers, which most likely 
are significantly different, are crucial components in determining 
what ean be understood about relationships of objects to each other, 
whether they are physical objects, concepts, ideas—or compounds .” 
Moreover, human perception of molecules is strongly context- 
and order-dependent [5], i.e., dependent on the order in which 
we view compounds and their structural context, different conclu¬ 
sions about molecular patterns and associated properties are usually 
drawn. However, computational similarity methods must account 
for molecular representations and patterns derived from such repre¬ 
sentations in a constant and context-independent manner. Hence, 
for computational analysis, the inherent subjectivity of the similarity 
concept must be formalized in a predefined and transparent man¬ 
ner, which requires approximations to be made. The introductory 
section concludes with the definition (and differentiation) of a few 
key terms to provide a basis for the following discussion of alterna¬ 
tive computational similarity concepts. 

1.1 Substructure It is important to distinguish between substructure searching/ 

Matching vs. Similarity matching and similarity calculations. Substructure search methods 

Analysis [6] are used to detect the presence or absence of a substructure 

(fragment) in a compound. By definition, substructure matching 
provides a binary (yes/no) result. In a substructure search, all 
compounds are retrieved from a database that contains a prespeci¬ 
fied substructure (query) or a combination of substructures. By 
contrast, similarity analysis must differentiate between different 
degrees of molecular similarity and hence capture a continuum of 
similarity relationships. Importantly, the question if two com¬ 
pounds are similar to each other and what their degree of similarity 
is cannot be answered from first principles. 
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1.2 Molecular vs. 
Chemical Similarity 


1.3 Molecular 
Similarity, 
Dissimilarity, 
and Diversity 


These terms are often synonymously used, which is not entirely 
correct. Chemical similarity primarily considers reaction informa¬ 
tion, the presence or absence of specific functional groups, and 
physicochemical properties. By contrast, the assessment of molecu¬ 
lar similarity is mostly based upon structural and topological 
features of compounds. In chemoinformatics, one typically 
attempts to extrapolate from molecular similarity to biological/ 
activity similarity, and not from chemical similarity (although the 
boundaries are often fluid). Hence, herein the focus is on molecular 
similarity. 

It is also important to distinguish between similarity, dissimilarity, 
and diversity. Dissimilarity is the inverse of similarity [7]. Molecular 
similarity and dissimilarity are best rationalized at the level of 
compounds pairs (i.e., on the basis of pairwise compound compar¬ 
isons). By contrast, diversity is a property of a compound set, which 
is closely related to chemical space coverage [8]. The major goal of 
diversity analysis is the generation of a compound set of limited size 
that best (evenly) covers a given chemical reference space [8]. In 
this context, chemical space is best understood as a computational 
construct, i.e., an ^-dimensional reference space obtained from 
n preselected descriptors [9] (see Note 1). Such descriptors are 
generally defined as mathematical models of molecular structure 
and/or properties and their complexity greatly varies [9]. 


2 Similarity Concepts 


In the following, alternative concepts for molecular similarity anal¬ 
ysis are presented. Key aspects of such similarity concepts are illu¬ 
strated in Fig. 1. First, fundamental methodological requirements 
are specified. 


2.1 Key Components 
of Computational 
Similarity Analysis 


Regardless of the specifics of molecular similarity analysis, the cal¬ 
culation of similarity values (see Note 2) requires two basic compo¬ 
nents including (1) a molecular representation to capture 
(similarity-relevant) molecular features and (2) a similarity function 
(often called similarity coefficient) to quantitatively compare 
chosen representations. In addition, a weighting scheme can be 
introduced (and might be considered as a third basic component) 
to differentially weigh (scale) individual features of a molecular 
representation for similarity calculations (if all features are equally 
considered, no weighting is required). Given this methodological 
framework, similarity calculations mostly (but not exclusively) rely 
on pairwise molecular comparisons, i.e., a pairwise assessment of 
molecular similarity relationships. For many similarity functions/ 
coefficients, calculated similarity fall within the interval [0, 1]. 
A similarity value of c 0’ reflects the presence of completely distinct 
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Fig. 1 Different concepts for molecular similarity analysis are schematically illustrated. 2D similarity is 
assessed on the basis of molecular graphs and 3D similarity on the basis of compound conformations. 
Furthermore, global similarity methods compare representations of entire molecules (such as a structure- 
and/or property-based bit string representation), whereas local similarity methods compare subgraphs or 
predefined geometrical features of compounds 

representations and a value of c l’ the presence of identical repre¬ 
sentations (see Note 3). This also gives rise to the fundamental 
numerical relationship between similarity and dissimilarity: 

Dissimilarity = 1 — Similarity 

Although many similarity methods rely on pairwise compound 
comparisons, this is not strictly required. For example, partitioning 
algorithms operate by assigning compounds to subsets of similar 
ones or, alternatively, to subsections of chemical reference space 
(also termed cells) on the basis of descriptor (coordinates) [10]. 
In such cases, no pairwise compound comparisons are carried out. 
Rather, proximity in chemical reference space, e.g., mapping to the 
same cell, is applied as a similarity criterion. 

2.2 Two- vs. Three- Either 2D or 3D molecular representations can be utilized as a basis 

Dimensional Similarity for similarity calculations. In general, 2D representations are 

derived from information provided by molecular graphs. Popular 
2D representations for similarity analysis include ‘fingerprints’ that 
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2.3 Global vs. Local 
Similarity 


capture, for example, molecular fragments and structural patterns 
[11], topological pathways through compounds [12], or topologi¬ 
cal atom environments [13]. Fingerprints encode this information 
either as bit strings [11, 12] or feature sets [13]. Such 2D finger¬ 
prints are often of very different design. For example, the molecular 
access system structural key fingerprint (MACCS) [11] consists of 
166 structural fragments with 1-10 nonhydrogen atoms. If a com¬ 
pound contains a specific feature, the corresponding bit position is 
set to ‘1’; otherwise, it is set to c 0.’ This represents a common 
procedure for many (but not all) binary fingerprint representations. 
Different from MACCS, the extended connectivity fingerprint with 
bond diameter four (ECFP4) [13] captures local bond topologies 
as atom environments that specify the connectivity of atoms in the 
neighborhood of each nonhydrogen atom in a compound. The size 
of the neighborhood depends on the bond diameter. Following the 
ECFP design, many different atom environment features are gen¬ 
erated in a molecule-specific manner. Thus, this fingerprint does 
not have a fixed format but consists of a feature set. Furthermore, 
approaches for similarity analysis that are based on 3D representa¬ 
tions include different approaches to compare molecular conforma¬ 
tions [14], shape matching algorithms [15], or 3D fingerprint 
methods [16]. Such 3D fingerprints encode conformation- 
dependent molecular properties or geometric compound features. 
Because compounds are active in specific 3D conformations (so- 
called bioactive conformations), 3D similarity methods should in 
principle have higher information content than 2D approaches. 
However, this does not mean that 3D methods necessarily perform 
better than methods based upon 2D representations. Since molec¬ 
ular similarity is typically assessed as an indicator of activity similar¬ 
ity, as further discussed later, 3D similarity methods can only 
produce meaningful results if they use information from bioactive 
conformations. However, given the uncertainties associated with 
identifying bioactive conformations of ligands in the absence of 
experimental 3D structures, 2D approaches are often less error 
prone and more robust than 3D methods and produce superior 
results in activity predictions made on the basis of similarity analysis. 

Molecular similarity assessment can either focus on entire com¬ 
pounds, applying a global view of similarity, or parts of compounds, 
applying a local view. A prime example for local similarity assess¬ 
ment is the ‘pharmacophore’ concept [17]. A pharmacophore is 
generally defined as the (spatial) arrangement of those atoms or 
groups in a compound that are responsible for its biological activity. 
Thus, pharmacophore approaches apply a strictly local view of 
similarity by attempting to directly focus on activity determinants. 
As such, pharmacophore modeling is often hypothesis driven. In a 
pharmacophore search, database compounds are selected as candi¬ 
dates that match a given pharmacophore model, but structurally 
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2A Similarity- 
Property Principle 


depart from known reference molecules (from which the pharma¬ 
cophore was derived) in regions outside the pharmacophore [17]. 
Accordingly, pharmacophore searching is essentially a matching 
procedure producing a binary (yes/no) readout, analogously to 
substructure searching. By contrast, other computational 
approaches to similarity evaluation apply a global, whole-molecule 
view. For example, if compounds are translated into structural 
fingerprints, as discussed earlier, global molecular representations 
are obtained, the comparison of which results in global similarity 
assessment. Such global views of similarity based upon more or less 
abstract molecular representations are a hallmark of many chemoin- 
formatics methods. These considerations directly lead us to a fun¬ 
damental similarity principle. 

From a book publication [18], which has been a milestone event for 
the chemoinformatics field, the similarity-property principle (SPP) 
emerged. The SPP simply states that similar compounds should have 
similar properties , with biological activity representing the most 
important property. This principle clearly reflects a central issue of 
molecular similarity research (as already discussed previously), i.e., 
the extrapolation from computed molecular similarity to activity 
similarity, without taking activity data directly into account. Despite 
its apparent simplicity, the SPP has profound and complex meth¬ 
odological consequences. First and foremost, it requires the appli¬ 
cation of a global molecular view and a consistent definition and 
computational assessment of similarity. In contrast to pharmaco¬ 
phore modeling, the SPP does not make any assumptions about 
substructures or functional groups in compounds that are activity 
relevant. Rather, it implies that gradual changes in molecular struc¬ 
ture are accompanied by gradual changes in activity. By contrast, 
small structural modifications of compounds that greatly affect or 
abolish biological activity, which are often observed in chemical 
optimization (and best accounted for using local similarity con¬ 
cepts), fall outside the applicability domain of the SPP, as illustrated 
in Fig. 2. Despite the fact that the SPP cannot account for all 
structure-activity relationships, it represents a paradigm for similar¬ 
ity searching where one generally attempts to identify structurally 
increasingly diverse compounds having biological activities similar 
to known reference molecules [19]. Similarity searching represents 
one of the most popular applications of molecular similarity analy¬ 
sis. In the following, fingerprint similarity searching is discussed as 
an example, which includes all key components of similarity analysis 
and also illustrates major opportunities and limitations of similarity 
calculations. 
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IC 50 = 1.2 pM 


IC 50 > 2000 pM 


Fig. 2 A protein kinase inhibitor (left) and a structural analog (right) are shown. These compounds are very 
similar and only distinguished by the substitution of an amino with a nitro group. However, the inhibitor on the 
left is active, whereas the analog on the right is essentially inactive. On the basis of the SPP, these compounds 
would not be distinguished and would be assumed to have similar activity if the compound on the left was 
known to be active. As activity measurements, IC 50 values are reported that give the compound concentration 
at half-maximal inhibition. The figure was adapted from [2] 


3 Fingerprint Searching 

Fingerprint similarity searching is conceptually based on the SPP. 
Fingerprints including the exemplary MACCS and ECFP4 designs 
introduced previously represent global molecular representations. 
They are compared using similarity functions to obtain a numerical 
value that quantifies fingerprint (bit string or feature set) overlap, 
which is used as a measure of molecular similarity. In a typical 
fingerprint search, one or more known active compounds are 
selected as reference molecules and their fingerprint representations 
are searched against fingerprints of database compounds [19]. The 
outcome of a similarity search is a ranking of database compounds 
according to decreasing similarity to the reference(s). The general 
goal is the identification of compounds from the ranking that differ 
structurally from the known reference(s) but have similar activity, as 
further discussed later. 

A variety of similarity functions/coefficients have been introduced 
for molecular similarity calculations (and were often adapted from 
other research fields) [20, 21]. In chemoinformatics, the Tanimoto 
coefficient (Tc) [21, 22] represents the most popular similarity 
function. For two vectors of real values, A and B , the general 
Tc (Tc g ) is defined as: 

n 

Eaa 

Tc g (A,B)=- -^-*- 

ET+E^-E^a- 

i— 1 i= 1 i= 1 

For binary vectors such as the MACCS fingerprint, this formulation 
is reduced to: 


3.1 Similarity 
Functions 




238 Jurgen Bajorath 


Tc (A,B) 


c 

pl-\- b — c 


Here, a and b are the number of features present in the fingerprints 
of compounds A and B , respectively, (represented by bit positions 
set to 1) and c is the number of features that are common to A and B. 
Ai and B\ represent the ith instances of compounds that are com¬ 
pared and n is the total number of compounds. Thus, the binary Tc 
quantifies fingerprint overlap by producing similarity values between 
0 and 1, which is the case for many similarity coefficients, as men¬ 
tioned earlier. 

The Tc is symmetric because the similarity of A with respect to 
B is the same as the similarity of B with respect to A, which 
represents a characteristic feature of many (but not all) similarity 
coefficients that are used. 

From the Tc, a dissimilarity measure is derived by calculating 
the complement known as the Soergel distance (Sg) [20]: 


Sg(A B) = 1 - Tc (A, B) = 1 - C - 

PL + b — c 

Another similarity function that can be used to introduce asymme¬ 
try in similarity calculations is the Tversky coefficient (Tv) [21,23] 
defined as: 


TVaAA, B) = — - \Ar(1 -Hi - 

Here, a, b, and c correspond to the Tc formalism. The two 
additional parameters a and /? (typically representing values 
between 0 and 1) are introduced to weigh the number of features 
that are unique to A or B , i.e., (a — c) and (b — r), respectively. If A 
and B are fingerprints of a reference and database compound, 
respectively, the larger a becomes relative to /?, the more weight is 
put on the unique bit settings of A and the less weight on the bit 
settings of B (and vice versa). This introduces asymmetry in the 
similarity calculations and makes it possible to emphasize unique 
features of reference or database compounds. For the special case 
a = = 1, features of A and B are equally weighted and Tv is 

identical to Tc. Furthermore, in the case a = /? = 0.5, Tv is trans¬ 
formed into the Dice coefficient (Dc) [20]: 


Dc (A,B) 


c 

\(pl+ b) 


Here, the denominator represents the arithmetic mean of the num¬ 
ber of features in A and B. 

These similarity functions illustrate the variety of coefficients 
(there are many more) that have been adapted for quantifying the 
similarity of molecular fingerprints (and other representations). 
Although the Tc is currently the most popular coefficient, it is not 
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3.2 Search 
Strategies 


3.3 Significance of 
Similarity Values 


a priori superior to others that yield different similarity values. As 
further discussed later, the relative ranking of compounds is critical 
for outcome of similarity searching, but not the absolute magni¬ 
tude of similarity values. Regardless of chosen coefficients, it is of 
critical importance for similarity analysis that a chosen similarity 
function quantifies similarity relationships in a consistent manner. 

However, a general statistically grounded complication of sim¬ 
ilarity searching that principally affects all coefficients is that increas¬ 
ing size or topological complexity of compounds typically increases 
the feature (bit) density of binary fingerprints, which causes a 
tendency to produce higher similarity values for larger compounds 
[19]. Such molecular size or complexity effects in fingerprint 
searching can be overcome by equally considering bits set to ‘1’ 
and c 0’ (the latter reflect feature absence and are usually not taken 
into account) [24] or by merging fingerprints with their bit com¬ 
plements [25], thus producing representations of constant bit den¬ 
sity for all compounds (see Note 4). 

If only a single reference compound is available, its fingerprint is 
compared to the fingerprints of all database compounds in a pair¬ 
wise manner and the database compounds are ranked accordingly. 
If multiple references are available, which usually increases the 
information content of similarity searching, different strategies 
can be applied to take this information into account. For example, 
following the centroid approach [26], an average fingerprint is 
calculated for all reference molecules and compared to individual 
fingerprints of database compounds (see Note 5). Alternatively, 
following a nearest neighbor approach [26], similarity values of a 
given database compound are separately calculated for all reference 
molecules. Then, the largest value is assigned to the database 
compound or the topk values are averaged to yield a final similarity 
score. Such nearest neighbor approaches are currently most widely 
used to combine contributions from multiple reference molecules. 

If calculated similarity values are to be used as indicators of 
biological activity, key questions include (1) whether similarity 
values are statistically significant and (2) whether similarity thresh¬ 
old values exist that firmly indicate the presence of activity relation¬ 
ships between reference and test compounds. Answering such 
questions is a nontrivial task. First of all, similarity values typically 
change for different combinations of fingerprints and similarity 
coefficients [3, 27], as illustrated in Fig. 3. Hence, they must be 
considered individually for given representations and similarity 
function. From global similarity value distributions, statistical sig¬ 
nificance of similarity values can be calculated using conventional 
^-values [3]. For example, a Tc threshold at a significance level of 
p = 0.01 reflects a probability of 1 % that the Tc value calculated 
for two randomly chosen compounds meets or exceeds this 
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Fig. 3 Reported are Tc (darkgra y) and Dc (lightgra y) similarity value distributions 
for 10 million comparisons of randomly chosen small molecules using the (a) 
MACCS and (b) ECFP4 fingerprint 

threshold. Importantly, however, significance levels are only based 
upon the distributions of similarity values and do not take 
biological activity as an associated property into account. Hence, 
one ultimately needs to determine whether similarities of com¬ 
pounds sharing the same activity occur by chance or if calculated 
similarity values at a certain level are indicative of similar activity. 
Therefore, different compound activity classes were used to calcu¬ 
late MACCS Tc similarity values for all pairs of compounds sharing 
the same activity [3]. Depending on the activity class, median 
MACCS Tc values varied from ~0.3 to -0.75. Thus, the similarity 
values also showed strong compound class dependence. Further¬ 
more, a MACCS Tc value of -0.65 was found to correspond to a 
significance level of p = 0.01. Thus, many compounds sharing the 

































3.4 Compound 
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same activity yielded similarity values that not only varied but were 
not statistically significant. Corresponding observations were made 
for combinations of other fingerprints and similarity functions. It 
follows that generally applicable fingerprint similarity threshold 
values as potential indicators of specific biological activities cannot 
be determined with certainty, which substantially complicates 
similarity searching. Therefore, similarity-based compound rank¬ 
ings must be carefully analyzed. 

The output of fingerprint similarity searching is a ranking of data¬ 
base compounds according to decreasing similarity to reference 
molecules. If we search a database for new active compounds, it is 
a priori clear that most database compounds cannot be specifically 
active. Hence, rankings are expected to mostly consist of inactive 
compounds. A ranking begins with those compounds that are most 
similar to the references(s) typically including structural analogs. 
Such analogs have the highest probability to be active but are not 
very interesting candidates for selection (they can be readily identi¬ 
fied through a substructure search using a given core structure as a 
query). Rather, one would like to identify compounds that struc¬ 
turally depart from known references but retain similar activity, in 
accord with the SPP. Figure 4 shows examples of structurally 
diverse compounds having similar activity that were successfully 
identified on the basis of similarity search calculations. In order to 
identify such candidate compounds, one must proceed further 
down the database ranking where more distant structural relation¬ 
ships occur, without considering the magnitude of similarity values. 
It has also been shown that fingerprint Tc threshold values that 
would indicate a significant enrichment of specifically active com¬ 
pounds in a database ranking cannot be determined [27], for the 




Adrenoreceptor 

antagonists 




Estrogen 

antagonists 




Endothelin A 
antagonists 


Fig. 4 Shown are examples of compounds with limited (remote) similarity that share the same specific 
biological activity and were identified using similarity searching. The figure was adapted from [2] 
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reasons discussed earlier. However, systematic fingerprint search 
calculations have revealed that a few structurally novel active com¬ 
pounds are usually found at relatively high rank positions [27, 28], 
which reflects an early enrichment characteristic for small subsets of 
available active compounds in similarity search calculations. For 
example, one or more novel active compounds are frequently 
detected among the -100 top-ranked database compounds [28]. 
These candidate compounds cannot be identified on the basis of 
calculated similarity values but by inspecting the continuum of 
similarity relationships captured by compound rankings. One 
does not know where exactly these active compounds are ranked 
but often has a good chance of identifying at least some of them by 
considering -100 highly ranked candidates. Hence, despite the 
difficulties associated with comparing absolute similarity values for 
given fingerprints, similarity functions, and compound classes, the 
results of similarity search calculations become meaningful when 
focusing on individual compound rankings. Early enrichment char¬ 
acteristics of fingerprint search calculations provide an opportunity 
to identify small numbers of novel active compounds. However, 
there is also global enrichment detectable, although one cannot 
determine generally applicable similarity threshold values indicating 
such enrichment. As a rule of thumb, it has been shown that there 
typically is a statistically significant enrichment of available active 
compounds in database rankings when at least -1 % of all database 
compounds are selected [27]. For a search database of today’s size 
containing a million or more compounds, this fraction still corre¬ 
sponds to 10,000 or more candidates, by far too many for inspec¬ 
tion and individual selection. Nonetheless, such global enrichment 
characteristics offer substantial opportunities for focusing of com¬ 
pound libraries on individual targets. Even if 10 % of database 
compounds would be selected on the basis of similarity search 
calculations to ensure likely coverage of available active com¬ 
pounds, significant progress would be made to limit the magnitude 
of experimental efforts for compound screening. 


4 Conclusions 


Molecular similarity analysis is a core task in chemoinformatics. 
Herein, alternative similarity concepts have been introduced and 
discussed that are relevant for the comparison of small molecules 
and evaluation of their similarity relationships. A key aspect of 
similarity analysis is that one is typically not interested in evaluating 
molecular similarity per se but in considering computed similarity 
relationships as an indicator of activity similarity, as best exemplified 
by the similarity-property principle. Molecular similarity analysis is 
also becoming increasingly relevant for bioinformatics applications 
such as the analysis of gene expression profiles of pharmaceutically 
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relevant compounds or the systematic study of ligand-target inter¬ 
actions and prediction of ligand-based target relationships. In such 
cases, boundaries between chemo- and bioinformatics become 
rather fluid. In order to review key aspects of (global) molecular 
similarity analysis in context, fingerprint similarity searching has 
been discussed, which highlights general opportunities and limita¬ 
tions of similarity calculations. Fingerprints are bit string represen¬ 
tations of molecular structure (and associated properties) that are 
relatively simplistic in their design. Using similarity coefficients, 
fingerprint overlap is quantified as a measure of molecular similarity 
from which one extrapolates to activity similarity (without taking 
activity data or parameters explicitly into account). It has been 
emphasized that absolute similarity values have little, if any mean¬ 
ing for biological activity and that similarity threshold values that 
might be relevant for specific activities cannot be determined. In 
fact, the strong molecular representation and compound class 
dependence of similarity calculations continues to represent a 
major conundrum in chemoinformatics that is just beginning to 
be addressed in a comprehensive manner. Nonetheless, similarity 
search calculations have value and practical utility in the identifica¬ 
tion of novel active compounds. Computed similarity values are 
informative on a relative scale and represent a continuum of simi¬ 
larity relationships that can be further explored. Similarity-based 
database rankings produced by fingerprint searching often display 
an early enrichment of small numbers of active compounds. In 
practical applications where a limited number of candidate com¬ 
pounds are selected for testing, this is often sufficient to identify 
novel active molecules. Regardless of any methodological consid¬ 
erations and computational concepts, it should be understood that 
similarity is in its essence a subjective concept and that any attempts 
to quantify similarity relationships between molecules, or any other 
objects, will have intrinsic shortcomings. Being aware of such lim¬ 
itations will help to avoid pitfalls associated with (mis-) 
interpretation of calculated similarity values and focus on opportu¬ 
nities of the approach. After all, a consistent computational assess¬ 
ment of molecular similarity relationships is an absolute must, 
despite principal limitations, given that our ability to subjectively 
judge about similarity relationships is limited to rather small num¬ 
bers of compounds and clearly insufficient considering current data 
volumes. 


5 Notes 


1. For informatics applications, chemical space is usually approxi¬ 
mated using molecular descriptor-based reference spaces, which 
typically vary depending on the specific requirements of a 
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computational application. An a priori representation of chemi¬ 
cal space does not exist. 

2. Most but not all similarity methods calculate numerical similar¬ 
ity values to quantify a similarity relationship between two com¬ 
pounds. This is a major attraction of similarity analysis because 
complex molecular relationships are ultimately reduced to a 
simple numerical score. However, a caveat is that calculated 
similarity values are often over- or mis-interpreted, as discussed 
in the text. 

3. It should be noted that a similarity coefficient value of c l’ 
resulting from the comparison of identical molecular represen¬ 
tations does not necessarily mean that the compared compounds 
are also identical. This is the case because molecular representa¬ 
tions often abstract from compounds to varying degrees 
(i.e., they are essentially compound models). 

4. Merging a fingerprint with its complement effectively doubles 
its length (which might lead to an increase in background noise 
of search calculations in the absence of complexity effects) but 
produces a constant bit density of 50 % for all test compounds, 
irrespective of their size and complexity. It follows that this 
modification renders the fingerprint representation independent 
of molecular size and complexity effects (due to constant bit 
density). 

5. Even for binary fingerprints, the reference centroid represents a 
real-valued vector, which requires the application of the general 
Tc (Tcq) to compare the centroid vector with binary fingerprints 
of database compounds. 
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Chapter 14 


Compound Data Mining for Drug Discovery 

Jurgen Bajorath 


Abstract 

In recent years, there has been unprecedented growth in compound activity data in the public domain. 
These compound data provide an indispensable resource for drug discovery in academic environments as 
well as in the pharmaceutical industry. To handle large volumes of heterogeneous and complex compound 
data and extract discovery-relevant knowledge from these data, advanced computational mining approaches 
are required. Herein, major public compound data repositories are introduced, data confidence criteria 
reviewed, and selected data mining approaches discussed. 

Key words Compound activity data, Public databases, Confidence criteria, Structure-activity rela¬ 
tionships, Matched molecular pairs, Activity cliffs, Activity profiles 


1 Introduction 


The number of compounds and associated activity data available in 
public domain databases currently increases at unprecedented rates. 
Compound activity data provide an important knowledge base for 
drug discovery if the data can be effectively mined [ 1 ]. Specifically, 
structure-activity relationships (SARs) can be systematically 
extracted for compounds active against current targets and utilized 
in compound design and optimization. Historically, most com¬ 
pound activity data have originated from the pharmaceutical indus¬ 
try and, for the most part, have been kept proprietary. However, 
with the changing drug discovery landscape, mergers and acquisi¬ 
tions, increasing discovery activities in academia, and more empha¬ 
sis on discovery collaborations between biotechnology companies, 
the pharmaceutical industry, and academic environments, the situ¬ 
ation has changed over the past decade. Consequences of structural 
changes in traditional drug discovery settings, the advent of aca¬ 
demic drug discovery initiatives, and much more frequent and 
dynamic interactions between academia and pharma include, 
among others, data publication and release at significantly increas¬ 
ing rates. Chemical data are still not comparable in magnitude to 
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biological data that continue to challenge bioinformatics. How¬ 
ever, in addition to massively growing compound data volumes, 
there also is rapidly increasing heterogeneity and complexity of 
chemical data. Taken together, these developments continuously 
increase the “big data” character of compound activity data, similar 
to biological data [2]. The “big data” nature in terms of volumes 
and complexity substantially challenges data organization, cura- 
tion, and mining activities, not only in the pharmaceutical industry 
but also in the public domain. While public compound and data 
repositories have become essential foundations of drug discovery 
research in academia, it is also being recognized in pharmaceutical 
settings that one can no longer afford to ignore publicly available 
data as a source of knowledge to complement and further advance 
in-house research and development activities. Hence, there clearly 
is increasing focus on public domain compound data. 


2 Compounds, Structures, and Activity Data 

2.1 Public Domain Current major publicly accessible databases for compounds and 

Repositories activity data include ChEMBL [3, 4], BindingDB [5], PubChem 

[6, 7], Open PHACTS [8], and DrugBank [9]. In addition, there 
are a number of more specialized smaller public databases (and also 
large commercial databases) that are not discussed herein. 
ChEMBL has become the major repository for compound activity 
data from medicinal chemistry sources and has recently also added 
patent information [4], which is of high relevance for drug discov¬ 
ery. The ChEMBL database has originated from a small company 
environment and later become a part of the European Bioinformat¬ 
ics Institute Outstation of the European Molecular Biology Labo¬ 
ratory where it is further developed [3, 10]. BindingDB was 
founded in academia where it continues to be advanced and main¬ 
tained. It was initially designed to collect data for compounds active 
against targets for which three-dimensional structural information 
was available and has then increasingly incorporated compound 
activity data from the medicinal chemistry literature and other 
sources [4, 10]. In addition, as a repository for the Molecular 
Libraries Initiative of the US National Institutes of Health, Pub¬ 
Chem has become the major public domain resource for biological 
screening data [6] and also maintains large compound and sub¬ 
stance collections [7]. Open PHACTS resulted from a joint venture 
of a variety of academic institutions, small companies, and large 
pharmaceutical companies to provide pharmacological information 
for drug discovery in both the public and private sector via semantic 
web technologies [8]. The basic data unit is a so-called pharmaco¬ 
logical record. An Open PHACTS pharmacology record reports a 
biological target and associated information and/or the activity of a 
given compound. Moreover, DrugBank, which also originated 



Compound Data Mining for Drug Discovery 249 


2.2 Data Volumes 


2.3 Data Complexity 


2.4 Confidence 
Criteria 


from an academic setting, is one of the major resources for 
approved and experimental drugs as well as drug target information 
[9]. There is ongoing exchange of data between major public 
repositories including ChEMBL, BindingDB, and PubChem. It is 
fair to say that ChEMBL and PubChem currently represent the 
major public sources of compounds and activity data from medici¬ 
nal chemistry and biological screening, respectively. 

At the beginning of 2014, the ChEMBL database (release 17) 
alone contained more than 1.3 million compounds with unique 
structures associated with more than 12 million activity annotations 
for -9300 biological targets. In addition, PubChem’s Compound 
[6], Substance [6], and BioAssay [7] collections contained -49 
million compounds, -128 million substances, and activity data 
from -740,000 assays, respectively. Furthermore, there were 
3846 confirmatory bioassays available in PubChem involving 
2533 biological targets. The availability of such compound data 
volumes could not have been imagined just a few years ago. Data 
volumes in these public repositories further increase on a daily basis. 
For example, in the current version of ChEMBL (release 18) com¬ 
pound activity data for -100 additional targets have become avail¬ 
able (compared to release 17). 

In addition to growing volumes, heterogeneity and complexity of 
compound activity data continuously increase [2]. Different types 
of assays, activity measurements, and target annotations at varying 
confidence levels are reported. In addition, structural information 
is often represented and organized in different ways. Information 
provided for active compounds or drugs in different databases is 
usually overlapping but distinct. This is best illustrated using an 
example. Figure 1 shows the kinase inhibitors lapatinib and imati- 
nib that are approved drugs used in cancer treatment. For these 
drugs, one can readily compare the activity and target information 
available in different databases [2]. Early in 2014, DrugBank 
recorded eight targets for lapatinib, ChEMBL reported three tar¬ 
gets for which high-confidence activity data was available, and 
BindingDB 814 records with defined activity measurements for 
this drug. Imatinib, on the other hand, was annotated with 24 
targets in DrugBank and 30 high-confidence targets in ChEMBL. 
Furthermore, in PubChem, lapatinib was assayed 1556 times and 
active in 311 assays and imatinib was tested in 2467 assays and 
active in 469 of these. Hence, even for established and well- 
characterized drugs, many different measurements and target 
annotations are available, which are often difficult to reconcile. 

In light of the above, it is of critical importance to carefully consider 
data curation and confidence criteria. This can also be illustrated 
using an example [11]. Figure 2 shows two similar drugs, 
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Fig. 1 Shown are the two marketed protein kinase inhibitors lapatinib und imatinib 



Fig. 2 Shown are two structurally distantly related drugs, pyrimethamine and milrinone, which are used for 
different therapeutic indications 

pyrimethamine and milrinone, which are used for different thera¬ 
peutic indications, i.e., pyrimethamine is an antimalarial com¬ 
pounds and milrinone an inotropic cardiotonic agent. In 
DrugBank, pyrimethamine and milrinone were annotated with 
two targets and one target, respectively. By contrast, the protein 
target summary function of ChEMBL provided 22 and 42 targets 
for pyrimethamine and milrinone, respectively, hence indicating a 
large potential inconsistency. However, one needs to take into 
consideration that the protein target summary lists all potential 
targets for an active compound or drug, regardless of the assays 
used, the type of activity measurements, and the confidence level of 
target annotations. Thus, when stringent data selection criteria 
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3 Data Mining 


3.1 Virtual 
Compound Screening 


3.2 Matched 
Molecular Pairs 


were applied to search ChEMBL (see Note 1) only one target 
remained for each pyrimethamine and milrinone, which closely 
matched the target annotations reported in DrugBank. Thus, care 
must be taken to critically evaluate activity data and be aware of 
database-specific organization schemes, data acceptance criteria, 
reported measurements, and confidence levels to avoid drawing 
premature conclusions. 


Given the volumes, heterogeneity, and complexity of compound 
activity data, there is a need for clearly defined data selection criteria 
and the development of advanced data mining concepts [ 1 ]. In the 
following, examples of advanced data mining strategies are dis¬ 
cussed that focus the exploration of compound activity data in 
different ways on systematic SAR analysis and knowledge extraction 
for compound design and optimization. On the other hand, virtual 
screening aims to identify new active compounds rather than 
explore activity and SAR information. 

For more than two decades, virtual screening has been one of the 
most popular approaches in chemoinformatics and computational 
medicinal chemistry [12]. Ligand-based virtual screening aims at 
the identification of novel active compounds on the basis of known 
active reference molecules [12]. Here, the main goal is the identifi¬ 
cation of structurally diverse compounds having activity similar to 
the references, often referred to as “scaffold hopping” [13]. For 
this purpose, similarity-based computational screening methods are 
applied [14]. Compound potency is typically not taken into 
account as a search parameter in virtual screening. Rather, one 
attempts to computationally extrapolate from active reference 
molecules by applying principles of molecular similarity and dissim¬ 
ilarity [14], regardless of the potency levels of the references. 
Because virtual screening aims at the identification of novel active 
compounds, it is often not carried out in biologically annotated 
databases such as ChEMBL, but rather in compound databases 
such as ZINC [15], which currently contains ~35 millions of 
small molecules that are typically not biologically annotated. If 
virtual screening campaigns are carried out in biologically anno¬ 
tated databases, they mostly try to identify additional targets for 
known active compounds. Other exemplary data mining 
approaches discussed in the following focus much more on the 
large-scale assessment of SAR information, rather than the identifi¬ 
cation of novel hits. 

The concept of matched molecular pairs (MMP) [16] has become 
increasingly important in medicinal chemistry and compound data 
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Fig. 3 Two pairs of structurally analogous inhibitors of cyclin-dependent kinase 2 (CDK2) are shown that form 
matched molecular pairs (MMPs) and activity cliffs (MMP-cliffs). Substructures that distinguish the com¬ 
pounds in MMPs are encircled and logarithmic potency (plC 50 ) values of compounds are reported. IC 50 values 
give the compound concentration at half-maximal inhibition 

mining. An MMP is defined as a pair of compounds that only differ 
by a structural change at a single site, i.e., the exchange of a 
substructure [16]. Exemplary MMPs are shown in Fig. 3. MMPs 
can be algorithmically effectively generated from large compound 
sets [17]. This renders the MMP formalism applicable to large-scale 
compound data mining. A major attraction of this approach is that 
changes in activity or other molecular properties associated with 
MMP formation can be readily identified and compared for system¬ 
atically generated MMPs. Hence, well-defined structural changes 
encoded by MMPs can be directly related to changes in biological 
activity or other drug-discovery relevant properties [16]. This pro¬ 
vides a basis for the prediction of property effects in compound 
design and optimization. For example, structural modifications can 
be identified that consistently increase the potency of compounds 
active against a specific target. In addition, compound series can be 
detected that represent SAR transfer events [18]. SAR transfer 
involves series of pairwise corresponding structural analogs (i.e., 
pairs of compounds with corresponding chemical modifications) 
that contain distinct core structures but display similar potency 
progression. Hence, the identification of SAR transfer series 
makes it possible to replace one compound series, which might be 
toxic or exhibit other liabilities, with another series having similar 
(desired) SAR behavior, which is affected by such liabilities. 
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Therefore, the exploitation of SAR transfer events is highly attrac¬ 
tive for compound optimization efforts. 

3.3 Activity Cliffs The activity cliff concept is also highly relevant for SAR analysis and 

compound optimization and amenable to large-scale data mining. 
An activity cliff is generally defined as a pair of structurally similar or 
analogous compounds that are active against the same target but 
display a large difference in potency [19]. Accordingly, structural 
modifications of active compounds can be deduced that cause 
significant biological effects, which rationalizes the relevance of 
activity cliffs for SAR analysis and compound design. For data 
mining, similarity and potency difference criteria for activity cliff 
formation must be clearly defined and consistently applied [19]. 
Compound similarity can be computationally assessed in a variety of 
ways [14] including the formation of MMPs, as discussed above (see 
Note 2). Thus, MMP-cliffs have been introduced as a structurally 
conservative and generally applicable representation of activity cliffs 
[20]. An MMP-cliff is formed by an MMP encoding a small struc¬ 
tural modification (similarity criterion) if the participating com¬ 
pounds are active against the same target and display a potency 
difference of at least two orders of magnitude (potency difference 
criterion) [20]. Hence, the pairs of active compounds in Fig. 3 also 
represent MMP-cliffs. Applying alternative criteria for activity cliff 
formation (including MMP-cliffs), all activity cliffs formed by pub¬ 
lic domain active compounds have recently been extracted from 
ChEMBL and organized by targets [21]. It was found that 
-10-20 % of active compounds formed activity cliffs in most 
target-based compound data sets (depending on the data set and 
the applied activity cliff definition). Thus, a significant proportion 
of currently available bioactive compounds form activity cliffs, 
which provide a large knowledge base for SAR exploration and 
compound optimization. From activity cliffs, SAR determinants 
can often be deduced. Utilizing the MMP and activity cliff con¬ 
cepts, compound data mining has already provided a large body of 
information for practical medicinal chemistry and drug discovery 
applications. 

3.4 Activity Profiles Computational methods can also be applied to systematically 

extract all high-confidence target annotations of compounds from 
activity data and generate compound activity profiles. This makes it 
possible to systematically assess the promiscuity of bioactive com¬ 
pounds [22] (see Note 3). Similarity relationships between com¬ 
pounds do not need to be considered to assess their promiscuity. 
However, compounds can also be structurally organized, for exam¬ 
ple on the basis of their scaffolds (see Note 4). Through data 
mining, activity profiles of compounds and corresponding scaffolds 
have been systematically determined and organized according to 
their degree of promiscuity [23]. Activity profiles of promiscuous 
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scaffolds can then be used to aid in the design of compounds with 
multi-target activities, which addresses another increasingly attrac¬ 
tive objective in drug discovery research. 

3.5 Conclusions Compound activity data currently grow at unprecedented rates in 

the public domain and provide a valuable resource for drug discov¬ 
ery in academia and the pharmaceutical industry. Compound data 
are not the only source of discovery-relevant information. 
Biological, pharmacological, and clinical data are equally, if not 
more important, depending on the stage of drug discovery and 
development efforts. However, compound activity data are most 
relevant for medicinal chemistry and early-phase compound devel¬ 
opment. Compound data growth is accompanied by substantially 
increasing data heterogeneity and complexity, which challenges 
data mining and knowledge extraction. Herein, major public com¬ 
pound data repositories have been introduced and data volume, 
complexity, and confidence issues discussed. Since there is a clear 
need for advanced computational approaches and data mining 
strategies, selected concepts have also been reviewed that focus on 
large-scale SAR exploration with utility for compound design and 
optimization. It is anticipated that additional computational meth¬ 
ods will be developed that closely link large-scale data mining 
efforts and predictive modeling to systematically generate experi¬ 
mentally testable SAR hypothesis and facilitate automated com¬ 
pound design. 


4 Notes 


1. The following specifies a protocol for the selection of high- 
confidence activity data from ChEMBL that we routinely 
apply: 

“ Only compounds with direct interactions (i.e., ChEMBL target 
relationship type a D v ) at the highest confidence level (i.e., 
ChEMBL target confidence score 9) are extracted. Two different 
types of potency measurements are separately considered including 
(assay-independent) equilibrium constants (Ki values) and 
(assay-dependent) IC 50 values. Furthermore, approximate mea¬ 
surements such as cc > ”, cc < ”, or are disregarded. For com¬ 
pounds with multiple Ki or IC 50 values for the same target, the 
geometric mean of all potency values is calculated to yield the final 
potency annotation, provided all potency measurements fall 
within the same order of magnitude. If this is not the ease, the 
measurements are discarded .” 

The application of these selection criteria typically eliminates 
experimental inconsistencies, focuses on high-confidence data, 
and leads to reliable target annotations (which are separately 
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considered for Ki or IC 50 measurements that cannot be directly 
compared). 

2 . Alternative similarity criteria for activity cliff assessment include 
the calculation of similarity values on the basis of molecular 
descriptors such as fingerprints that exceed a predefined thresh¬ 
old value. Such similarity values can also be consistently calcu¬ 
lated and are often used for activity cliff analysis. A potential 
drawback of their use is that similarity relationships calculated 
on the basis of molecular descriptors are often more difficult to 
interpret chemically than substructure relationships estab¬ 
lished, for example, on the basis of MMPs. Principles of molec¬ 
ular similarity analysis and similarity calculations are discussed 
in more detail in the accompanying chapter by the same author. 

3. Promiscuity refers here to the presence of specific interactions 
between a bioactive compound and multiple targets (as 
opposed to nonspecific binding events). The so-defined pro¬ 
miscuity provides the molecular basis of polypharmacology, 
which is an emerging theme in drug discovery. It is being 
recognized that many active compounds elicit therapeutically 
relevant effects through interactions with multiple targets and 
the ensuing pharmacological consequences (for example, by 
activating or interfering with multiple signaling pathways). 

4. A scaffold essentially represents the core structure of a com¬ 
pound. Scaffolds can be generated in different ways, for exam¬ 
ple, by removing all substituents from a molecule and retaining 
the substructure containing all rings. An activity profile of a 
given scaffold is obtained by calculating the union of the activ¬ 
ity profiles of all compounds the scaffold represents. This pro¬ 
file can be further refined by weighting individual activities by 
their frequency of occurrence in the activity profiles of different 
compounds represented by the scaffold. 
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Chapter 15 


Studying Antibody Repertoires with Next-Generation 
Sequencing 

William D. Lees and Adrian J. Shepherd 

Abstract 

Next-generation sequencing is making it possible to study the antibody repertoire of an organism in 
unprecedented detail, and, by so doing, to characterize its behavior in the response to infection and in 
pathological conditions such as autoimmunity and cancer. The polymorphic nature of the repertoire poses 
unique challenges that rule out the use of many commonly used NGS methods and require tradeoffs to be 
made when considering experimental design. 

We outline the main contexts in which antibody repertoire analysis has been used, and summarize the key 
tools that are available. The humoral immune response to vaccination has been a particular focus of 
repertoire analyses, and we review the key conclusions and methods used in these studies. 

Key words Antibodies, Antibodyome, Repertoire analysis, Rep-Seq, Next generation sequencing 


1 Introduction 


The adaptive immune system embodies huge diversity, and current 
methods are not capable of determining the entire antibodyome— 
the complete set of antibodies—of a mammalian species. Neverthe¬ 
less, with high-throughput sequencing, it is now possible to sample 
at a sufficient level to gain an overview of the molecular response to 
a pathogen. This response is generally targeted towards a restricted 
set of antigens (molecules that induce an immune response). For 
example, the surface glycoproteins hemagglutinin and neuramini¬ 
dase are the main targets of antibodies directed against the influ¬ 
enza A virus. 

Vaccination, with its origins in medieval China and India, is 
probably the single most important public health measure of all 
time, and yet there are many pathogens, among them HIV, for 
which no successful vaccine has yet been developed. Even among 
well-established vaccines, some, such as those against influenza and 
tuberculosis, have limited effectiveness and breadth. By studying the 
change in antibody repertoire during a course of vaccination and 
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during the course of a disease, we can determine the molecular 
impact of vaccination on immune memory, and compare it with 
the memory elicited by the disease itself By doing so, we can identify 
the best antigens and presentations to use in vaccines, and verify their 
ability to raise a lasting and broadly neutralizing response across the 
population as a whole [ 1 ]. We can also improve our understanding of 
the applicability and limits of animal models, which are often used in 
the surveillance of human infection as well as in the development of 
new treatments. As well as their use in the defense against external 
pathogens, antibodyome studies are used in cancer research, both to 
understand the natural immune response, and to establish how it 
could be changed or steered through vaccination, either before or 
after the cancer is established [2]. 

What information can computational repertoire analysis pro¬ 
vide in pursuit of these goals? The questions asked in a study 
typically include the following: 

- Which antibody germlines (see Subheading 2.2) are active in the 
response to a particular antigen, and how prevalent are they in 
the population as a whole? 

- What antibody elonotypes (see Subheading 2.2) are elicited, how 
abundant are they, and how broad-spectrum is their response? 

- Is there evidence of convergence, with antibodies originating 
from different germlines directed at the same antigenic target? 

- Is there appropriate isotype switching (see Subheading 2.3)? 

- How long is the development pathway to a given antibody of 
interest, and what are the key development steps? 

- Is the initial response converted into long-lasting immune mem¬ 
ory (see Subheading 2.4)? 

- Are there potential obstacles to the development of an effective 
vaccine, for example undue focus on the development of non¬ 
neutralizing antibodies? 

Computational techniques are mandated by the sheer volume 
of information obtained from sequencing studies. The challenges 
imposed by the antibodyome, in particular the high degree of 
sequence polymorphism and the particular mutation characteristics 
of somatic hypermutation, have driven the development of special¬ 
ist tools, which we will highlight in this chapter. The current 
generation of tools tend to have limitations in terms of species 
coverage, sequencing requirements, performance, and so on, 
meaning that a careful match must be made for a particular experi¬ 
mental analysis. 

Repertoire studies frequently bring together other sources of 
data, besides that available from next-generation sequencing. Anti¬ 
body germline libraries, such as those available from the IMGT 
databases [3], are used to determine germline ancestry. Isolated 
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antibodies of interest, such as those that are identified as binding to 
the target antigen, are often sequenced by low-throughput methods. 
Crystallographic studies of antibody-antigen complexes may be 
available, and can be used to inform studies of clonotype evolution. 
Finally, other high-throughput tools, such as molecular mass spec- 
trography, may be employed to characterize particular isolates [4]. 


2 Background Concepts 

Here we briefly describe the key immunological concepts that are 
relevant to this work, and provide references for further informa¬ 
tion. We focus on human immunity, although the overall mechan¬ 
isms and principles are broadly applicable. A more detailed 
introduction to the topics discussed can be found in Murphy [5]. 

2.1 Antigen Antibodies are molecules which exist both as the membrane-bound 

Recognition receptors of B cells and as discrete molecules secreted by plasma 

cells. The monomeric form has a Y-shaped structure, in which the 
two identical arms of the Y contain the antigen receptors (Fig. 1). 
The molecule is made up of two identical heavy chains and two 
identical light chains, bound by disulfide bonds. At the N terminus 
of the four chains are the variable regions, which bind to antigen. 


N terminus 


Antigen binding 



C terminus 

Fig. 1 Antibody structure. The antibody is composed of four chains: two identical heavy chains, and two 
identical light chains. The chains are joined by disulfide bonds. The variable regions of each chain, at the N 
terminal end, contain the antigen receptors. The constant regions at the C terminal end contain the effectors, 
which determine the antibody function 
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2.2 Receptor 
Development 


Light chains have in addition a single constant region, while heavy 
chains have three constant regions. The C terminal constant region 
of the heavy chain, CH3, contains the antibody effector, which 
determines the antibody function and hence its class, or isotype. 

The variable regions are each composed of three hypervariable 
loops, or complementary-determining regions, CDRs 1-3. Inter¬ 
spersed with these are four framing regions, FRs 1-4. During the 
development of receptor specificity, mutations occur primarily in 
the CDRs, and it is mainly residues in the CDRs that make contact 
with antigen, although evidence that non-CDR residues play a 
crucial role is growing [6, 7]. 

The variable region of the heavy chain is encoded by three DNA 
segments. At the 3' end is the V segment, which encodes framing 
regions FRs 1-3, CDRs 1 and 2, and a component of CDR3. The 
next is a short D segment, which encodes a part of CDR3, and 
finally the J segment encodes the 5' end of the CDR3 and FR4. 
Multiple sequentially diverse copies of each segment exist in the 
germline. In an antibody-producing cell, specific V, D, and J genes 
are brought together to form a complete sequence by a process of 
gene rearrangement. A similar process is followed for the light 
chain, except that there are only two segments, Vand J [8, 9]. 

Part of the diversity of antibodies comes from the random 
combination of gene segments, as described above. Further diver¬ 
sity comes from the process by which the segments come together 
to form the “junction” of which CDR3 is composed. The combi¬ 
nation process is noisy, allowing for gene segments to be truncated 
and also for additional nucleotides to be inserted. Insight can be 
gained from determining the germline origin of an antibody (i.e., 
the particular segments from which it was derived), but the process 
of recombination can make it difficult to determine the origin of all 
segments with certainty. Because recombination involves the inser¬ 
tion and deletion of nucleotides it can lead to frame shifts. Result¬ 
ing DNA sequences are classified as productive or nonproductive, 
depending on whether they can encode a functional protein. Fur¬ 
ther development and diversity follows through somatic hypermu¬ 
tation, a process through which mutations are introduced into the 
variable region during transcription through the action of 
activation-induced cytidine deaminase (AID). Through a process 
known as affinity maturation, B cells expressing antibody with high 
affinity to a target antigen are selected [10]. 

B-cells that share a common rearrangement (and hence 
descend from an identical naive B-cell) are said to share the same 
clonotype. The human response to a single specific antigen is 
estimated from experiment to elicit <100 clonotypes [11]. Exami¬ 
nation of the somatic hypermutation of clonotypes can cast light on 
the evolution of antigen specificity. As the phylogenetic evolution 
of distinct clonotypes proceeds independently, clonotypes provide 
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the best basis for the understanding of maturation pathways. How¬ 
ever, just as germline attribution cannot in most cases be deter¬ 
mined conclusively, there is scope for error in the attribution of 
clonotypes. 

While the focus of this chapter is on B cells, it should be noted 
that T cell receptor structure and development follows a similar 
course, except that T cell receptors do not undergo somatic hyper¬ 
mutation. Software tools developed for T cell analysis may be useful 
for B cell analysis also, but may require modification to account for 
the greater sequence diversity. 

2.3 Isotypes There are five main isotypes: IgA, IgD, IgE, IgG, and IgM. IgM is 

the isotype produced initially by maturing B cells, and is therefore 
the isotype seen first in an immune response. As well as being 
present on the cellular membrane it can be secreted as a discrete 
molecule, where it has a pentameric structure and is found almost 
exclusively in the bloodstream. Through the process of isotype 
switching, again mediated by AID, mature B cells can switch irre¬ 
versibly to another isotype [12]. The two types of most interest in 
vaccine-induced repertoire studies are IgA and IgG. Both can be 
secreted as discrete molecules. IgA is monomeric or dimeric, and is 
the principal class in mucosal secretions. IgG is always monomeric, 
and is the principal class in serum. A strong immune response will 
feature class switching from IgM to an appropriate combination of 
isotypes for the infection: as an example, the best response to a 
respiratory infection might be expected to contain a component of 
IgA: however this could be challenging to elicit with inoculation, 
which the immune system will recognize as a blood-borne 
pathogen. 

2.4 Immunological While antibody-secreting plasma cells generally have a brief life- 

Memory time, estimated to range from several days to several months, a 

subset migrate to survival niches in which they can survive and 
continue to secrete antibodies for sustained periods that can last 
for many years [13]. Memory is also preserved by memory B cells, 
which can last for the lifetime of an individual, without the need for 
repeated exposure to an antigen for reinforcement. Memory B cells 
develop towards the end of an infection, populating the spleen and 
lymph nodes and circulating at low levels in the blood [14]. 


3 High-Throughput Determination of Antibody Repertoires 

The total number of antibody rearrangements made possible by the 
mechanisms of gene rearrangement and somatic hypermutation is 
thought, in humans, to exceed 10 11 . The number of unique rear¬ 
rangements in an individual at any one time is lower but still 
considerable: one study estimating the number of unique B-cells 
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as at least 3.5 x 10 10 [15]. Often, for experimental or ethical 
reasons, only a limited sample, such as a sample of peripheral 
blood, is available for analysis, although it has been estimated that 
only 2 % of B-cell diversity is present in peripheral blood [ 16] . Even 
where the entire organism is available, the diversity in mammals is 
several orders of magnitude above our current sequencing capabil¬ 
ities, and the cautions of working with small samples apply. 

High-throughput sequencing of antibody repertoires, which 
has become known as Rep-Seq [17, 18], starts with the isolation 
of genomic DNA (gDNA) or messenger RNA (mRNA) from cells 
of interest. mRNA is considered to be more informative of the 
repertoire, as apparently viable gDNA sequences may be nonfunc¬ 
tional as a result of monoallelic gene expression or other mechan¬ 
isms, but as levels of mRNA may vary from cell to cell, there is a risk 
that mRNA read counts will not correlate well with cell popula¬ 
tions. However, a recent study did find high correlation between 
functional gDNA and mRNA sequence frequencies in human 
peripheral blood samples [19]. 

V-region mRNA is typically isolated for sequencing by nested 
RT-PCR. Multiplexed primers are required because of the degree 
of polymorphism. For full V-region amplification, 3' primers are 
usually selected to be complementary to a section of the constant 
region, making them independent of the variable region germline. 
5' primers, on the other hand, are often complementary to a section 
of the V-region FR1, and a set of V-gene germline-dependent 
primers must therefore be chosen. This raises the possibility of 
germline-dependent amplification bias, which must be checked 
for if sequence counts are used to infer germline abundance, for 
example by comparing V-germline abundance inferred from PCR/ 
NGS data with that inferred by single-cell analysis, or by checking 
for correlation in gene utilization calculations derived from two 
independent primer sets. An alternative approach to PCR amplifi¬ 
cation is to use 5' RACE (rapid amplification of cDNA ends), which 
does not require a 5' primer: however, RACE protocols are suscep¬ 
tible to nonspecific amplification and can have low efficiency. 

Because of the high diversity of V-gene sequences, exacerbated 
by the concentration of that diversity into short CDRs separated by 
relatively constant FRs, it is not possible to employ sequence 
assembly to join the short reads typically generated by current 
high-throughput sequencers. The sequencers typically employed 
for Rep-Seq are the Roche 454 the Illumina MiSeq and the Illu- 
mina HiSeq. These are capable of covering the entire V-gene 
(Table 1), although the HiSeq only acquired this capability in late 
2014, and many HiSeq-based studies have been limited to a specific 
region, typically the CDR3. New sequencing technologies, that will 
provide increased depth-of-coverage at reduced cost, are under 
development, but the overall tradeoff between entire V-gene 
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Table 1 

Characteristics of sequencers typically employed for Rep-Seq studies 


Sequencer 

Read length (nt) 

Max. reads per sample 

Approx, per-base error rate 

Roche 454 

700-1000 

~1 x 10 6 

~10 -4 

Illumina MiSeq 

600 (2 x 300) 

~2.5 x 10 6 

~10“ 4 

Illumina HiSeq 

500 (2 x 250) 

~3 x 10 8 

~1(T 3 


The V-gene can extend in some cases to >400 nt in length. With suitable PCR primers, the sequencers listed can 
sequence entire V-genes. The number of reads per sample is taken from manufacturers’ data sheets, and numbers 
obtained in practice are often an order of magnitude lower 


coverage on the one hand and maximum depth of coverage on the 
other is likely to persist for some time. 

Because of the high degree of polymorphism in V-regions, 
separating sequencing read errors and PCR amplification errors 
from genuine diversity presents a challenge. An approach taken in 
many studies is to eliminate from consideration all V-sequences 
which are only observed once, on the assumption that a repeated 
error at the same location will be rare. Greater discrimination can be 
gained by adding a randomized barcode to each RNA molecule 
prior to PCR amplification. All reads with the same barcode should 
share the same sequence, and the error-free sequence can therefore 
be determined by consensus [20, 21]. Sequence abundance can be 
determined from unique barcodes, removing the risk of PCR 
amplification bias. Three analysis pipelines that can process bar- 
coded reads have been published: Migec [22], Presto [23] and 
IgRepertoireConstructor [24]. Migec and IgRepertoireConstruc- 
tor also have the capability to parse the V(D)J junction (see next 
section). 

The heavy and light chains of an antibody are joined by disul¬ 
fide bonds, and the sequencing techniques discussed above are not 
capable of determining which heavy and light chains are paired in 
particular antibodies. To determine this pairing, mRNA or gDNA 
products from individual cells must be isolated and identified prior 
to sequencing. A number of enhanced throughput techniques have 
been developed for this [25-27] but these techniques remain 
highly specialized and none have been reported that will determine 
the pairing at the high volumes at which sequencing is possible. In 
the absence of a method to determine the actual pairing, likely 
pairing of functionally active antibodies may be inferred via combi¬ 
natorial phage display in which the chain of interest is paired with 
many possible chains derived from the sample [28]. 
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4 Sequence Analysis and Inference of Germline Ancestry 

Once sequences have been obtained, the next step will usually be to 
determine the boundaries of the complementary-determining and 
framework regions, by reference to the antibody germline 
sequences of the organism concerned. Determining the sequences 
and boundaries of CDR1 and CDR2, and the framework regions 
adjacent to them, is relatively straightforward as the entirety of 
these regions is coded in the V-gene. Assuming that a reference 
set of germline V-gene sequences is available, and prealigned 
against a reference numbering scheme such as that provided by 
the IMGT numbering scheme [29], the sequence of interest can 
be aligned against the closest-matching germline, and its constitu¬ 
ent frames determined from the alignment. While this approach has 
been used successfully in many analyses, and forms the backbone of 
the online analysis tools described later in this section, it should be 
noted that V-sequences may be formed by an alternate process of 
gene conversion, in which sections of the ancestral V-gene are 
replaced by sections from alternate genes. This form of rearrange¬ 
ment has been observed most notably in the rabbit, but can also 
occur in other organisms including humans [30-32]. In B-cells, 
affinity maturation through the mechanism of somatic hypermuta¬ 
tion also gives rise in some cases to highly diverged V-sequences 
where the germline ancestry may not be readily deduced. 

A similar analysis of the V(D)J junction, based on the ancestry 
of the constituent genes, is more challenging for a number of 
reasons. The relatively small size of the D- and J-genes can make a 
definitive determination of ancestry difficult. The insertion and 
deletion of nucleotides at the junction between segments can 
make it difficult or impossible to infer with certainty exactly 
which nucleotides at the boundary were attributable to which 
gene. In a small number of cases—estimated at ~0.5 % of recombi¬ 
nations in humans—two D-genes can combine in tandem, to form 
a V-D-D-J junction [33]. The overall approach adopted by the 
majority of published tools relies on the presence of a conserved 
cysteine at the 5' end of the junction and a conserved residue at the 
3' end—tryptophan for heavy chains and phenylalanine for light 
chains and T-cell receptor chains. In this approach, the algorithm 
first considers nucleotides at the 5' end, starting from the conserved 
cysteine, comparing them to the nucleotides expected for the 
germline V-gene. The V-gene boundary is identified at the point 
that nucleotide divergence from the germline exceeds a defined 
threshold. The same process is then initiated at the 3' end, working 
downwards through the junction and comparing observed nucleo¬ 
tides against those of the parent J-gene in order to determine the 
J-gene boundary. The residual sequence lying between the V-gene 
and J-gene boundaries is then scored against all possible D-genes. 
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Finally, heuristics are applied to determine inserted palindromic 
sequences and inserted N-nucleotides [34, 35]. Another important 
component of junction analysis is the identification of nonproduc¬ 
tive recombinations, containing frame shifts or stop codons. 

Where it is not critical to establish the germline ancestry of the 
V(D)J region, the sequence can be determined using pattern¬ 
matching methods that take advantage of the conserved residues. 
An advantage of this approach is that, because it does not rely on 
the determination of V-gene ancestry, it can be applied to short 
reads that do not extend substantially into the V-region [36, 37]. 

Both online tools and downloadable analysis tools are available 
(Table 2). The most frequently cited online tool for high-volume 
analysis is IMGT/High V-Quest [38]. While High V-Quest is 
supported by germline libraries for a growing number of species, 
it is not possible to use the tool with a customized or user-supplied 
germline library, limiting its use in certain applications. It is only 
available as an online service, and analysis completion times are 
dependent on system load. Another high-volume online service is 

Table 2 

Software tools for analysis of the V(D)J junction and associated ancestry 


Tool Online version? Local version? Source code? Custom germlines? 


IMGT V-Quest, High V-Quest 

Yes 

No 

No 

No 

IgBLAST 

Yes 

Yes 

No 

Yes 

iHMMune-align [42] 

Yes 

Yes 

Yes 

Yes 

Ab-origin [43] 

No 

Yes 

No 

Yes 

JOINSOLVER [44] 

Yes 

No 

No 

No 

SoDA2 [45] 

Yes 

No 

No 

No 

VDJSolver [46] 

Yes 

Yes 

Yes 

Yes 

Vidjil [47] 

No 

Yes 

Yes 

No 

ARPP [48] 

No 

Yes 

No 

No 

Decombinator [49] 

No 

Yes 

Yes 

No 

Migec [22] 

No 

Yes 

Yes 

No 

MiTCR [50] 

No 

Yes 

Yes 

No 

VDJ [28] 

No 

Yes 

Yes 

No 

VDJFasta [15] a 

No 

Yes 

Yes 

No 

IgSCUEAL [40] 

Yes 

Yes 

Yes 

Yes 

IgRepertoireConstructor [24] 

No 

Yes 

Yes 

Yes 


Tools which do not allow a custom germline library to be defined through the user interface are indicated: these tools are 
provided with a built-in library and the ease with which it could be replaced has not been assessed. a VDJFasta is available 
at http://sourceforge.net/vdjfasta. The download location of other tools is documented in the referenced article 
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NCBI IgBlast [39], which employs the BLAST algorithm to deter¬ 
mine germline matches. IgBlast does support the use of customized 
germline libraries, and is available as a standalone program to run 
locally. IgSCUEAL [40], available both online and to run locally, 
takes a phylogenetic approach to germline assignment. It is hosted 
by the HyPhy genetic sequence analysis program [41]. While it is 
scalable, the approach is relatively compute intensive, requiring a 
high-performance compute cluster for high-volume analyses. IgRe- 
pertoireConstructor [24] is an open-source package for the con¬ 
struction of repertoires from Illumina sequence sets. It incorporates 
a novel approach to sequence error correction, and supports bar- 
coded reads and the integration of proteomics analyses. 

A number of approaches have been taken to identify clonotypes 
from high-volume sequencing data [51]. Strictly speaking, one 
would wish to establish that the V, D, and J genes of clonotypes 
have identical ancestry, and that they share a common pattern of N- 
insertions, however there is uncertainty in each of these determina¬ 
tions, particularly where the sequence is highly diverged. A starting 
point is to cluster sequences with the same inferred V, D, and J 
germline ancestry. Subsequent clonotype clustering may then pro¬ 
ceed on the basis of amino acid similarity [52] or nucleotide simi¬ 
larity [53]. While the latter approach appears a priori to conform 
more closely to the underlying mechanism of rearrangement, and in 
particular is more likely to respond to differences in “N” insertions, 
our own experience is that the two approaches provide broadly 
similar results, with both yielding well-differentiated clusters. A 
somewhat different approach is described by Giraud et al. [47] 
and implemented in Vidjil (see Table 2): in this approach, junction 
similarity is determined heuristically without full germline analysis. 
The Immunoglobulin Analysis Tool (IgAT) is a Microsoft Excel- 
based tool which provides extended analysis of IMGT results sets, 
including clonotype analysis [54]. 

Selective pressure in V-region development can be determined 
by comparing the ratio of observed non-silent to silent mutations: 
however there are specific biases in sequence specificity and base 
substitution in somatic hypermutation [55]. BASELINe is an 
online tool, also available as public domain software, which calcu¬ 
lates the selective pressure in CD and framework regions. It can be 
used, for example, to compare selective pressure in different regions 
and in different isotypes [56, 57]. 


5 Monitoring the Humoral Immune Response to Vaccination 

High-volume sequencing studies have been conducted with the 
aim of examining the B-cell response to vaccination or infection, 
and in particular to understand the process and pathways of somatic 
hypermutation. This is driven in large part by interest in the 
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development of vaccines that will elicit broad-spectrum responses 
to highly polymorphic viruses such as influenza and HIV. It is of 
particular interest in HIV, where the germline ancestors of broad 
spectrum antibodies are found to have little or no reactivity, sug¬ 
gesting that their elicitation may require a particular development 
path to be followed involving exposure to multiple antigens [58]. 
In this section, we describe the analyses that are typically conducted 
in these studies, and discuss their relevance to the overall problem. 

A number of Rep-Seq studies have confirmed that the B-cell 
response to viral antigens is associated with clonal expansion and 
isotype switching from IgM to IgA/IgG [20, 59-61]. Clonal 
expansion has been observed also in patients with cancer [62]. A 
subset of the plasmablasts generated during peak response will 
survive as long-lived plasma cells. Booster vaccination studies with 
influenza vaccine and tetanus toxoid vaccine have found peak plas- 
mablast production occurring 6-7 days after vaccination [60, 63]. 
Clonal analysis of samples taken at multiple time points, one around 
the peak period of plasmablast production, and one or more at 
times several weeks or months into the future, will facilitate the 
understanding of these two processes of rapid diversification and 
incorporation into long-term immune memory. An understanding 
of relative IgM levels at the different timepoints may be helpful in 
understanding the extent to which the response is based on the 
recall of immune memory [61]. 

Phylogenetic analysis can be used to characterize the develop¬ 
ment of antibodies of interest, but the combined effects of antibody 
recombination and NGS sequence read errors make this challeng¬ 
ing. Specific tools have been developed for the analysis of variable 
region ancestry and inference of intermediates [48, 64-66]. Phy¬ 
togenies that depict descent from a specific V-gene germline should 
be treated with care, as they are likely to combine descents from 
multiple V(D)J recombinations. Within specific clonotypes, 
sequence logo diagrams [67] provide a depiction that illustrates 
the residues explored by the clonotype, and the potential key 
residues that remain conserved. 

Identity/divergence plots (Fig. 2) (also known as divergence- 
mutation scatter plots) have been used by a number of authors to 
investigate the relationship between an antibody of interest (typi¬ 
cally an antibody known to be broadly or strongly neutralizing) and 
its germline, together with other antibodies that have the same 
germline ancestor [65, 68]. These plots tend to show a small 
number of sequences that are similar to the target antibody, and a 
large mass of sequences that are not. It is worth noting that this 
large mass of sequences is itself quite diverse, and this can be 
illustrated by coloring the points by clonotype [65] —an approach 
that will also draw out possible convergence between clonotypes 
towards the target antibody. 
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Fig. 2 Example of an identity/divergence plot, in which V-gene sequences matching the germline of an 
antibody of interest (marked in red) are plotted in terms of their identity to the antibody and their divergence 
from the germline (marked with a blue cross). As over 200,000 sequences are represented in this plot, they 
are represented by means of a contour plot indicating density 
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Chapter 16 


Using the QAPgrid Visualization Approach for Biomarker 
Identification of Cell-Specific Transcriptomic Signatures 

Chloe Warren, Mario Inostroza-Ponta, and Pablo Moscato 

Abstract 

In this chapter, we illustrate the use of an integrated mathematical method for joint clustering and 
visualization of large-scale datasets. In applying these clustering methodologies to biological datasets, we 
aim to identify differentially expressed genes according to cell type by building molecular signatures 
supported by statistical scores. In doing so, we also aim to find a global map of highly co-expressed clusters. 
Variations in these clusters may well indicate other pathological trends and changes. 

Key words Clustering, Visualization, Neuroscience, Genetics 


1 Introduction 


We have previously developed these methods for the analysis of 
gene expression microarray (GEM) datasets [1], such as those 
produced by studies that seek to reverse engineer the functional 
genomics of model organisms like Saccharomyces cerevisiae [2]. We 
have also applied the methods to the whole-genome analysis of 
RNA stability [3] and we have created an interactive map based 
on the results. In addition, our method has also been used to 
generate a successful redefinition of clinical cases of proctitis syn¬ 
drome [4]. Interestingly, in all these cases, we have dealt with 
situations in which the values under consideration (gene expres¬ 
sion, symptom measurements) change over time. This has lead us 
to evaluate this method as a potential tool to provide new insights 
into our understanding of coherent patterns of co-expression dur¬ 
ing the progression of neurodegeneration, which occurs in a num¬ 
ber of diseases including Alzheimer’s, Parkinson’s diseases and 
multiple sclerosis. 

However, preliminary investigation of some of the GEM data¬ 
sets associated with these diseases has led us to prioritize the work 
we present here in our research agenda. This work came about as a 
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2 Methodology 


2.1 Well Described 
Biomarker Gene Lists 


consequence of one of the shortcomings of GEM technologies. 
Namely, the biological samples used to generate GEM data can be 
comprised of whole tissue, meaning that there are multiple cell 
types present within any one sample. Such sample heterogeneity 
can complicate the functional biological interpretation of data, as 
well as the identification of biomarkers [5-7]. Herein we have been 
inspired to generate methodologies to identify clusters of gene 
expression signatures that define different cell types. 

With this contextual information, we introduce the QAPgrid 
integrated clustering and visualization approach in the field of 
identification of neurological cell-specific transcrip tomes. We use 
a GEM data set of distinct cell types (neurons, astrocytes, and 
oligodendrocytes) taken from mouse brains [8]. As previously 
mentioned, sample heterogeneity can be a major hindrance to the 
interpretation of biological data. In particular, the brain is 
renowned for its complex cellular composition. As such, these 
data were collected using novel cell separation techniques in order 
to provide a resource for improved understanding of the develop¬ 
ment, physiology and pathology of the brain [8]. 

We envisage that further development of these data visualiza¬ 
tion and analysis methods will lead to the production of a multidi¬ 
mensional resource, wherein graphical representation of expression 
clusters will be hyperlinked to specific expression data as well as 
online gene function information. 


The GEM data used in these investigations is publically available in 
the NCBI Gene Expression Omnibus (GEO; http://www.ncbi. 
nlm.nih.gov/geo/), via the accession number GSE9566. The 
data were gathered using Affymetrix GeneChip Arrays, and repre¬ 
sents three different cell types (neurons; astrocytes; oligodendro¬ 
cytes) which were acutely isolated from the forebrains of healthy 
mice. The mice’s ages varied between 7 and 17 days. The data were 
annotated with gene names and symbols using SOURCE (http:// 
smd.stanford.edu/cgi-bin/source/sourceSearch), and normalized 
such that the sum of all the probe sets values in a sample was equal 
to 1. 

The lists of well described biomarker genes (see Tables 1,2, and 3) 
were obtained from the original publication of the data [8], but it is 
significant that the basis for selection of these markers was not on 
the data itself, but on previous studies (see full list of references, see 
Table 4); it is therefore unbiased in its nature. It should be noted 
that, in GEMs, genes are often represented by multiple probe sets, 
which are comprised of 25-base sequences designed to hybridize to 
various points within the target gene [9]. A common issue with 
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Table 1 

Well described biomarker probe sets for astrocytes, and their location in super clusters and clusters, 
as specified by application of MST-kNN unsupervised clustering algorithm to GEM dataset of pooled 
cell types (astrocytes, neurons, and oligodendrocytes) isolated from mouse forebrains 



Gene 


Super 


Gene name 

symbol 

Probe sets ID 

cluster 

Cluster 

Solute carrier family 1 (glial high affinity glutamate 

Slcla2 

1438194_at 

1 

445 

transporter), member 2 


1459014_at 

1 

254 



1433094_at 

1 

445 



1457800_at 

1 

445 



1439940_at 

1 

445 



1451627_a_at 

1 

445 

Gap junction protein, beta 6 

Gjb6 

1448397_at 

1 

254 

Glial fibrillary acidic protein 

Gfap 

1440142_s_at 

1 

254 



14265 09_s_at 

1 

254 



1426508_at 

1 

254 

Solute carrier family 1 (glial high affinity glutamate 

Slcla3 

1426341_at 

1 

254 

transporter), member 3 


1439072_at 

1 

445 



144049 l_at 

1 

254 



1452031_at 

1 

254 



1443749_x_at 

1 

254 



1426340_at 

1 

254 

Aquaporin 4 

Aqp4 

1447745_at 

1 

254 



1425382_a_at 

1 

254 



1434449_at 

1 

254 

Aldolase C, fructose-bisphosphate 

Aldoc 

1424714_at 

1 

254 



1451461_a_at 

1 

559 

Fibroblast growth factor receptor 3 

Fgfr3 

1421841_at 

1 

254 



1425796_a_at 

1 

445 


GEM interpretation is that probe sets for the same gene often have 
different expression patterns [9, 10]. There are multiple explana¬ 
tions for this; probe sets can hybridize to splice variants of a gene, or 
can inadvertently hybridize nonspecifically to a different (i.e., 
wrong) gene due to sequence similarity [9]. In an attempt to 
limit the affect this issue has on our analysis of the effectiveness of 
our methods, we have discounted probe sets whose expression is 
non-cell specific from the lists of well described biomarkers, as seen 
in Tables 1,2, and 3. 

2.2 Clustering Clustering algorithms applied to gene expression data aim to find 

groups of related probe sets that share common characteristics, like 
relative expression values or expression profiles across a sample set. 
These common characteristics are usually measured using either a 
similarity or distance metric. The most common of such measures 
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Table 2 

Well described biomarker probe sets for oligodendrocytes, and their location in super clusters and 
clusters, as specified by application of MST-kNN unsupervised clustering algorithm to a GEM dataset 
of pooled cell types (astrocytes, neurons, and oligodendrocytes) isolated from mouse forebrains 


Gene name 

Gene 

symbol 

Probe sets ID 

Super 

cluster 

Cluster 

SRY-box containing gene 10 

SoxlO 

1451689_a_at 

1 

1 



142498 5_a_at 

1 

1 

Platelet derived growth factor receptor, alpha 

Pdgfra 

1421916_at 

1 

1 

polypeptide 





Gap junction protein, gamma 2 

Gjc2 

1450483_at 

1 

1 



1435214_at 

1 

1 

Myelin basic protein 

Mbp 

1451961_a_at 

1 

1 



1419646_a_at 

1 

11 



143620 l_x_at 

1 

11 



145465 l_x_at 

1 

11 



1456228_x_at 

1 

11 



1433532_a_at 

1 

11 



1425263_a_at 

0 

5 

Myelin oligodendrocyte glycoprotein 

Mog 

1448768_at 

1 

1 

Myelin-associated oligodendrocytic basic protein 

Mobp 

1433785_at 

1 

11 



1421010_at 

1 

1 



1436263_at 

1 

1 



1450088_a_at 

1 

1 

UDP galactosyltransferase 8A 

Ugt8a 

1419064_a_at 

1 

1 



1419063_at 

1 

1 

Galactose- 3 -O-sulfo transferase 1 

Gal3stl 

1454078_a_at 

1 

1 

Gal3stl 





Myelin-associated glycoprotein 

Mag 

1460219_at 

1 

1 

Myelin and lymphocyte protein, T cell 

Mai 

1432558_a_at 

0 

0 

differentiation protein 


1417275_at 

1 

26 


are the Euclidean based metric and correlation-based metrics. The 
choice of metric is a key step, as (1) it defines when two probe sets 
are going to be considered similar and (2) it has to be relevant for 
the questions being asked (for instance, the use of robust correla¬ 
tion metrics may be necessary for some problem domains). In order 
to analyze GEM data and find groups of related probe sets, we use 
the unsupervised graph-based clustering algorithm, MSTkNN [11, 
12]. We use the Pearson correlation distance metric, as shown in 
Eq. (1). This metric defines a distance between 0, totally uncorre¬ 
lated probe sets, and 2, totally correlated probe sets. It has the 
advantage of not being sensitive to the amount of expression, and 
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Table 3 

Well described biomarker probe sets for neurons, and their location in super clusters and clusters, as 
specified by application of MST-kNN unsupervised clustering algorithm to a GEM dataset of pooled 
cell types (astrocytes, neurons, and oligodendrocytes) isolated from mouse forebrains 


Gene name 

Gene 

symbol 

Probe sets ID 

Super 

cluster 

Cluster 

Neurofilament, light polypeptide 

Nefl 

1426255_at 

8 

561 



1454672_at 

8 

190 

Gamma-aminobutyric acid (GABA) A receptor, 

Gabral 

1455766_at 

8 

190 

subunit alpha 1 


1436889_at 

2 

472 



1421281_at 

8 

190 

Synaptotagmin I 

Sytl 

1421990_at 

8 

190 



1431191_a_at 

8 

561 



1433884_at 

2 

499 

Solute carrier family 12, member 5 

Slcl2a5 

1451674_at 

2 

499 

Synaptosomal-associated protein 25 

Snap25 

1416828_at 

8 

190 

Synaptic vesicle glycoprotein 2 b 

Sv2b 

1434800_at 

8 

190 



1435687_at 

8 

531 

Potassium voltage-gated channel, subfamily Q, 

Kcnq2 

1451595_a_at 

2 

152 

member 2 






being able to identify similar expression profiles between two probe 
sets. 


dxy ~ 1 P X y ( 1 ) 

The MSTkNN clustering algorithm is based on the use of two 
proximity graphs, namely the Minimum Spanning Tree (MST), 
and k Nearest Neighbor graph (kNN). The MST is a connected 
acyclic graph with the smallest sum of edge’s weights, and in the 
kNN graph, two vertices are connected if either one or the other are 
among the k closest neighbors of each other according to the edge’s 
weights. The algorithm first builds a complete graph G( T,E, W), 
with a vertex v for each of the probe sets, and an edge e xy for each of 
the probe set pairs (v,y), with the edge’s weight being the distance 
between the expression profiles of probe sets. First, the algorithm 
computes the Minimum Spanning Tree G MS j{ V^E MST) Wmst), 
where B MS t and Wmst^c subsets of E and irrespectively. Second, 
the algorithm computes the ^-Nearest Neighbor graph G kNN (V, 
EkNNiWkNN)? where again E kNN and W^nn are subsets of E and W 
respectively. The number of nearest neighbors to be considered is 
automatically computed using Eq. (2) (see below). Then, the algo¬ 
rithm computes the intersection of the edge sets of both graphs 
( Emst H E kNN ), which will produce a partition of the graph in 
c > 1 connected components. If c — 1, then the algorithm stops. 
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Table 4 

Literature references for well described marker genes of neurons, oligodendrocytes, and astrocytes 


Cell type 

Well described marker 

Symbol 

Reference 

Neuron 

Neurofilament, light polypeptide 

Nefl 

[19] 


Gamma-aminobutyric acid (GABA) A receptor, subunit 
alpha 1 

Gabral 

[20] 


Synaptotagmin I 

Sytl 

[20] 


Solute carrier family 12, member 5 

Slcl2a5 

[20] 


Synaptosomal-associated protein 25 

Snap25 

[21] 


Potassium voltage-gated channel, subfamily Q, member 2 

Kcnq2 

[22] 


Synaptic vesicle glycoprotein 2 b 

Sv2b 

[23] 

Oligodendrocyte 

Chondroitin sulfate proteoglycan 4 

Cspg4 

[24] 


SRY-box containing gene 10 

SoxlO 

[25,26] 


Platelet derived growth factor receptor, alpha polypeptide 

Pdgfra 

[20, 27] 


Gap junction protein, gamma 2 

Gjc2 

[28] 


Myelin basic protein 

Mbp 

[29, 30] 


Myelin oligodendrocyte glycoprotein 

Mog 

[30] 


UDP galactosyltransferase 8A 

Ugt8a 

[31] 


Galactose - 3 - O - sulfo transferase 1 /(Cerebroside 
sulfo transferase) 

Gal3stl 

[32] 


Myelin-associated oligodendrocytic basic protein 

Mobp 

[33] 


Myelin-associated glycoprotein 

Mag 

[30] 


Myelin and lymphocyte protein, T cell differentiation protein 

Mai 

[34] 

Astrocyte 

Solute carrier family 1 (glial high affinity glutamate 
transporter), member 2 

Slcla2 

[35] 


Gap junction protein, beta 6 

Gjb6 

[36] 


Glial fibrillary acidic protein 

Gfap 

[29, 37, 38] 


Solute carrier family 1 (glial high affinity glutamate 
transporter), member 3 

Slcla3 

[39] 


Aquaporin 4 

Aqp4 

[40] 


Aldolase C, fructose-bisphosphate 

Aldoc 

[41] 


Fibroblast growth factor receptor 3 

Fgfr3 

[42] 


Otherwise (c > 1), the algorithm is recursively applied in each of 
the connected components. Algorithm 1 shows the pseudocode of 
the clustering algorithm. It receives a distance matrix and it returns 
a graph with c > 1 connected components. A schematic represen¬ 
tation of the algorithm is shown in Fig. 1. 

k = min{ln(^),min^ / G^ is connected} (2) 

Firstly, the algorithm uses a large number of nearest neighbors 
to find clusters. It then uses a smaller number depending on the 
number of probe sets. It is also possible to restrict the number of 
nearest neighbors to a k max by adding it to Eq. (2): 
k = min{ln(^), mink/ Gk is connected, k max }. This condition 
forces the algorithm to consider only up to the k max larger simila¬ 
rities (or smallest distances). 





Algorithm 1 

MSTkNN clustering algorithm. Function connectedComponents(G) returns 
the number of connected components in G. Function submatrix(D, 

Gcluster ) get the distance matrix of the elements of the Ah connected 
component in G C luster 


MSTkNN(D: distance matrix) 

1. Compute G 

2. Compute G M st 

3. Compute G^nn-, with k = mm{\n(n),mmk / Gk is connected} 

4. GclUSTER = { VCLUSTER = K E CLUSTER = E MST H E kNN } 

5. c = connectedComponents (Gcluster) 

6. IF (c > 1) THEN 

7. Gcluster = H/ =1 MSTkNN (submatrix (D, G l CLUSTER )) 

8. ENDIF 

9. RETURN Gcluster 
END MSTkNN 


Expression data 


m samples 



The weight of the edges (W) 
depends on the distance/similarity 
metric selected 



Complete graph 
G(V,E r W) 



Minimum 
Spanning Tree 



k Nearest 
Neigbour 





The algorithm produces C>1 
clusters. In each of them, 
the method is recursively 
applied until c=i. 



Clusters 


Fig. 1 Schema of the clustering algorithm MSTkNN used for the analysis of 
mouse brain tissue. The algorithm automatically decides the number of neigh¬ 
bors to be considered in each step 
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For this data set, the MSTkNN algorithm found 760 clusters of 
similarly expressed probe sets, with the five largest clusters consist¬ 
ing of 2520 (cluster 252), 2067 (cluster 190), 1581 (cluster 2), 
1371 (cluster 50), and 1306 (cluster 68) probe sets. The large 
number of clusters found makes it hard to analyze the results. To 
help the analysis, we now group clusters of similarly expressed 
genes. We apply the clustering algorithm considering that each 
cluster is now an object, and the distance between two clusters 
corresponds to the average pairwise distance between all members 
of two clusters. In this new data set, the MSTkNN algorithm 
produces 14 “superclusters” (14 clusters of clusters). The largest 
supercluster consists of 324 clusters (supercluster 1). 

2.3 QAPgrid A clustering algorithm will deliver one or more groups that are 

related according to some criteria. One of the well-known short¬ 
comings of any clustering algorithm is that it can be too sensitive 
and therefore separate a group of probe sets that should be 
together. On the other hand, the algorithm might not separate 
probe sets that should be separated. In order to deal with these 
situations, we use a layout procedure based on the pairwise rela¬ 
tionships of all probe sets. The relationship between any two probe 
sets can be measured using any similarity/distance measure, in a 
similar way to that of the clustering algorithm. The layout proce¬ 
dure of the QAPgrid is a combinatorial optimization approach to 
generating an ordered layout which is mathematically guided in 
order to place highly related objects in nearby positions in a 2D 
grid. The layout problem is modeled as an instance of the Quadratic 
Assignment Problem, and since the QAP belongs to the NP-hard 
class of problems, we use an ad hoc Memetic algorithm [13, 14] 
which we have developed to deal with the QAP instances. In QAP, 
we need to assign a set of n > 0 facilities to m locations, given as 
input to a flow matrix between the facilities and the distance 
between locations to which these facilities would be assigned. The 
flow between facilities and distances between locations are repre¬ 
sented using matrices F and L, respectively. The goal is to minimize 
the overall flow Cin the system shown in Eq. (3) below, where s(i) 
represents the assigned location of facility i in a solution y and ls{i)s{j) 
represents the distance between locations of facilities i and /, while 
fij represents the flow between facilities i and j. Then, a good 
solution for the QAP will put in nearby locations facilities sharing 
a high flow according to the objective function C given by: 

^ = ^2 ^ fii^(i)sU) (3) 

i=l j=i+l 

In order to create instances of the QAP for the layout problem at 
hand, each probe set is represented by a facility, and the flow 
between facilities is created using Eq. (4). Following this, we create 
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a square grid of m locations, with m^> n, such that the algorithm 
has no space restrictions to produce the layout. Additionally, we are 
able to consider other information, such as a graph model that 
highlights certain probe sets’ relationships in the data set. Say we 
have a graph H( V,E), where each vertex in V represents a probe. In 
this graph, there would be an edge between two vertices if there was 
intent that these two vertices may be prioritized to be close in the 
final layout (i.e., apart from their similarity there might be other 
specific criteria). We bias the search towards attempting to co-locate 
these two probe sets together by increasing the flow value between 
the two given probe sets by a factor of M n as it is shown in 
Eq. (4). 


2.4 CM 2 Score 


fij = 


0 

M_ 

dij 

1 

dij 


if i = j 
if eij^E 


otherwise 


( 4 ) 


The QAPgrid algorithm first produces the layout of each cluster 
independently, before producing a layout of the clusters, as it is 
shown in Fig. 2. Then, probe sets in a cluster are organized accord¬ 
ing to their similarity, and clusters are also organized according to 
the similarities of the clusters’ members. 

The whole map generated can be seen in Figs. 3, 4 and 5. In 
these Figures, each element (yellow and purple) represents a cluster 
of probe sets and they are organized according to the output of the 
QAPgrid. In Subheading 4, we analyze the results aided by the 
layout produced. 


In order to investigate the validity of our methodology wherein we 
can be presented with large quantities of candidate probe sets 
(some of the clusters generated contain more than 2500), it was 
seen as appropriate to select ten representative probe sets from each 
cluster. In order to select the most appropriate probe sets, we have 
ranked the data by the CM 2 value (see below), and then selected the 
top ten. 

For each of the probe sets, we compute the difference between 
expression values between the classes of cells (samples) we are 
interested in versus the rest of the samples. We call this measure 
CM 2 . Probe sets which have the biggest difference in expression 
between cell types are ranked highest; these are the most likely 
candidates for biomarkers. Let’s say we are looking for candidate 
biomarkers for samples of class 1 (e\) and we refer to the rest of the 
samples as class 2 (r 2 ). Then, the CM 2 score is the difference 
between the average value of expression in Class 1 and Class 
2 divided by the minimum range observed in Class 1 and Class 2. 
To compute the range of observed probe set expression values in a 
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In the case that the data set 
contains more than one cluster, 
the algorithm produces an 
independent layout for each of 
them, considering the distance/ 
similarity between prudes 




"Then, the algorithm takes each 
cluster layout, and it generates a 
layout of clusters considering the 
relationships between clusters' 
members 



Fig. 2 Schema of the QAPgrid algorithm. For each cluster of objects, the 
algorithm produces a layout of the objects. Then, the algorithm produces a 
second level layout of each of the clusters 

class, we need to identify the maximum and subtract the minimum 
value observed in the class. As a consequence, the CM 2 score aims 
at identifying probe sets which are differentially expressed (on 
average). It does this by scaling the difference of averages by 
using the least variable class (the class that has the minimum 
range value). 


3 Analysis 


By comparing well described biomarker genes for cell type (see 
Tables 1, 2, and 3) to the clustering results, we can identify the 
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a) 




Fig. 3 (a) QAPgrid generated map graphic of gene clusters generated by application of MSTkNN unsupervised 
clustering algorithm to microarray dataset of pooled cell types (astrocytes, neurons, and oligodendrocytes) 
isolated from mouse forebrains. The location of clusters 224 and 445 is indicated, (b) Location of clusters 224 
and 445 at a larger magnification, where the location of neighboring nodes and edges in immediate region are 
displayed, (c) Location of clusters 224 and 445 (indicated by squares) at a larger magnification, where 
representative heat maps of neighboring clusters are also displayed. A candidate cluster for astrocyte markers 
is indicated by a circle 
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a) 




Fig. 4 (a) QAP generated map graphic of gene clusters, as generated by application of MSTkNN unsupervised 
clustering algorithm to microarray dataset of pooled cell types (astrocytes, neurons, and oligodendrocytes) 
isolated from mouse forebrains. The location of clusters 1 and 11 is indicated, (b) Location of clusters 1 and 
11 at a larger magnification, where the location of neighboring nodes and edges of the immediate region are 
displayed, (c) Location of clusters 1 and 11 (indicated by squares) at a larger magnification, where represen¬ 
tative heat maps of neighboring clusters are also displayed. A candidate cluster for oligodendrocyte markers is 
indicated by a circle 




























a) 


□ 

□ 



Fig. 5 (a) QAP generated map graphic of gene clusters, as generated by application of MSTkNN unsupervised 
clustering algorithm to microarray datasets of pooled cell types (astrocytes, neurons, and oligodendrocytes) isolated 
from mouse forebrains. The location of clusters 190 and 499 are indicated, (b) Location of cluster 190 at a larger 
magnification, where the location of neighboring nodes and edges in immediate region are displayed, (c) Location of 
cluster 190 ( squared) at a larger magnification, where representative heat maps of neighboring clusters are 
displayed. (*d) Location of cluster 499 at a larger magnification, where the location of neighboring modes and edges 
are displayed, (e) Location of cluster 499 (squared) at a larger magnification, where representative heat maps of 
neighboring clusters are displayed. **Circles in (c) and (e) represent candidate clusters for neuron markers 
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extent to which the algorithm has successfully clustered together 
probe sets with similar expression patterns. In addition, we can 
track these well described biomarker genes in order to find candi¬ 
date biomarkers within the same cluster. 

By researching those probe sets with the highest CM 2 score 
within clusters which are brought to our attention, we are able 
to demonstrate that these methods are successful in the search 
for biomarkers; there are multiple genes whose expression pro¬ 
file is notoriously cell specific within these clusters (see Tables 5, 
6, and 7). Just as significantly, the QAPgrid approach aids with 
the identification of further clusters wherein further biomarkers 
may be found, as clusters which consist of probe sets with 
similar expression profiles are mapped close together (see 
Figs. 4, 5, and 6). 

When we locate well described biomarker genes for astrocytes 
within the clusters and the QAPgrid, we see that their representa¬ 
tive probe sets are restricted mainly to just two clusters: 254 and 
445 (see Table 1), both of which are within super cluster 1 (see 
Table 1 and Fig. 3). All seven of the biomarker genes have 
representative probe sets localized to cluster 254 (see Table 1 
and Fig. 6). Upon closer inspection of these clusters (see Fig. 6), 
it is evident that the composite probe sets of both clusters share a 
similar pattern of expression across the cell types. In order to 
determine which of the 2520 (cluster 254) and 321 (cluster 
445) probe sets may give best candidates for a cell type marker, 
we can apply the CM 2 equation in order to rank the candidate 
probe sets appropriately. Some of the results (the top ten probe 
sets) of this approach can be seen in Table 5, wherein a literature 
review summary indicates the likelihood that we have indeed 
found a candidate cell marker. 

Furthermore, by examining the location of these clusters 
within the entire QAPgrid map, we can see that some of their 
neighbors also contain some probe sets whose expression pattern 
is cell specific, namely cluster 685 (see Fig. 3c). 

In searching for the location of well described biomarker genes 
for oligodendrocytes, we can see that the majority of the probe sets 
are within clusters 1 and 11 (both of which are within super cluster 1) 
(see Fig. 4 and Table 2). All but one of the ten well described 
biomarker genes for oligodendrocytes have representative probe 
sets in the same cluster, namely cluster 1 (see Table 2). Analysis of 
the representative heat maps for these clusters reveals an interesting 
pattern within cluster 11 (consisting of 25 probe sets), wherein there 
is an observable group of probe sets whose expression profile is 
highly similar (see Fig. 7). When we sort all 25 of the composite 
probe sets in this cluster by their CM 2 score, it is the constituent 
probe sets of this observed group (1433785_at, 1433532_a_at, 
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Table 6 

Top ten probe sets as result of ordering by CM 2 value for clusters 1 and 11, wherein the majority of well described biomarker genes for 
oligodendrocytes are located, as a consequence of the application of MSTkNN unsupervised clustering algorithm to a GEM dataset of pooled cell types 
(astrocytes, neurons, and oligodendrocytes) isolated from mouse forebrains 
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Table 7 

Top ten probe sets as result of ordering by CM 2 value for clusters 561, 499, and 190, wherein the majority of well described neuronal biomarker 
genes are located, as a consequence of the application of MST-kNN unsupervised clustering algorithm to a GEM dataset of pooled cell types 
(astrocytes, neurons, and oligodendrocytes) isolated from mouse forebrains 
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Fig. 6 (a) Cluster 254 and (b) Cluster 445; generated by application of MSTkNN unsupervised clustering 
algorithm to microarray dataset of pooled cell types; astrocytes (A), neurons (N) and oligodendrocyte (0) 
isolated from mouse forebrains. Both clusters contain representative probe sets for described astrocyte 
markers (seeTable 1). (a) Cluster 254; Slc1a2, Gjb6, Gfap, Slc1a3, Aqp4, Fgfr3. (b) Cluster 445; Slcla2, Slcla3, 
Fgfr3 



Fig. 7 (a) Cluster 1 and (b) Cluster 11 generated by application of MSTkNN unsupervised clustering algorithm 
to microarray dataset of pooled cell types; astrocytes (A), neurons (N) and oligodendrocyte (0) isolated from 
mouse forebrains. Both clusters contain representative probe sets for described oligodendrocyte markers (see 
Table 1). (a) Cluster 1; SoxlO, Pdgfra, Gjc2, Mbp, Mog, Ugt8a, Gal3st1, Mag. (b) Cluster 11; Mbp, Mobp. A 
region of probes sharing highly similar expression is indicated with asterisk 


1436201_x_at, 1419646_a_at, 1454651_x_at, 1456228_x_at) 
which have the highest CM 2 ranking (see Table 5). These probe 
sets map to the well described biomarker genes, myelin-associated 
oligodendrocytic basic protein and myelin basic protein, herein 
proving the efficacy of our methods wherein application of CM 2 
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can assist with identifying likely candidates for biomarkers. Indeed, 
of the 20 probe sets selected by CM 2 filtering, 14 hold previous 
evidence for cell-specific expression patterns (see Table 5). 

Looking at the location of these clusters of interest in the 
QAPgrid, we can identify yet another candidate of interest with 
highly specific gene expression profiles, namely cluster 18 (see 
Fig. 4c). 

Though the probe sets for well described neuronal biomarker 
genes are not as concentrated into the same clusters as with the 
other cell types under investigation (Fig. 5), when we look at the 
other probe sets which are within these particular clusters wherein 
the majority of the biomarkers lie (190, 499, and 561), we see that 
there are still plenty of candidate cell markers there. This indicates 
the usefulness of our methods (see Table 7). In addition, the QAP¬ 
grid shows that there are multiple neighboring clusters depicting 
cell-specific expression (see Fig. 8). 


4 Conclusions 


We present an integrated approach for clustering, selection, and 
visualization of patterns of gene expression that constitute molecu¬ 
lar signatures of cell-specific transcription. We validate its usefulness 
by employing a dataset that allowed us to identify both known 
markers of neuronal, oligodendrocyte, and astrocytic cell types as 
well as others that warrant further investigation. Our study suggests 
the following putative novel biomarkers of cell type: for astrocytes, 
pdk4, slcl5a2, Ttpa, Rfx4, Gli3, Sardh, Lonrf3, and Slc27al; for 
oligodendrocytes, Adamts4, Bcl7a, and Atf6; and for neurons, 
Tmod3, Kcns2, Dynclil, Mapk8ip2, Sphkap, Epb4.9, Zrsr2, 
Pmg211, Tmeml30, Kiel. The validation of these other putative 
biomarkers of cell type requires wet lab investigations. We believe 
these will soon follow as some of these genes have already attracted 
the attention of researchers working in neurodegenerative diseases. 
For instance Kiel seems to have a role in amyloid-beta accumula¬ 
tion and intracellular trafficking, thus linking it to Alzheimer’s 
disease [15-17]. We thus expect that researchers would be moti¬ 
vated to identify if all populations of neurons express Kiel and if, in 
addition, there are anatomical observable differences. Levels of 
KLC1 have been observed to be reduced in the frontal cortex, 
but not in the cerebellar cortex, of Alzheimer’s disease patients 
[18]. Our method allows a comprehensive analysis of all major 
groups of gene expression patterns across three different cell types 
and provides a basis for an investigation on disruption of these 
co-expression patterns in neurodegenerative diseases in model 
organisms. 
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Cluster 190 {2067 probes) Cluster 499 (316 probes) 



i ——i—'—r 

A N O 


Fig. 8 (a) Cluster 190, (b) Cluster 499 and (c) Cluster 561; generated by application of MSTkNN supervised 
clustering algorithm to dataset of pooled cell types; astrocytes (A), neurons (N) and oligodendrocyte (0) 
isolated from mouse forebrains. Both clusters contain representative probe sets for described neuronal 
markers (see Table 1). (a) Cluster 190; Nefl, Gabral, Sytl, Snap25m, Sv2b. (b) Cluster 11; Sytl, Slcl25a. 
(c) Cluster 561; Nefl, Sytl 
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Computer-Aided Breast Cancer Diagnosis with Optimal 
Feature Sets: Reduction Rules and Optimization Techniques 

Luke Mathieson, Alexandre Mendes, John Marsden, Jeffrey Pond, 
and Pablo Moscato 


Abstract 

This chapter introduces a new method for knowledge extraction from databases for the purpose of finding a 
discriminative set of features that is also a robust set for within-class classification. Our method is generic 
and we introduce it here in the field of breast cancer diagnosis from digital mammography data. The 
mathematical formalism is based on a generalization of the ^-Feature Set problem called (a, /?) -^-Feature Set 
problem, introduced by Cotta and Moscato (J Comput Syst Sci 67(4):686-690, 2003). This method 
proceeds in two steps: first, an optimal (a, /?) -^-feature set of minimum cardinality is identified and then, a 
set of classification rules using these features is obtained. We obtain the (a, f$) -^-feature set in two phases; 
first a series of extremely powerful reduction techniques, which do not lose the optimal solution, are 
employed; and second, a metaheuristic search to identify the remaining features to be considered or 
disregarded. Two algorithms were tested with a public domain digital mammography dataset composed 
of 71 malignant and 75 benign cases. Based on the results provided by the algorithms, we obtain 
classification rules that employ only a subset of these features. 

Key words Safe data reduction, Combinatorial optimization, Minimum feature set, Breast cancer 

diagnostics, Memetic algorithms 


1 Introduction 


Breast cancer is one of the most common types of cancer in women 
all over the world, for which the most effective detection method is 
screening mammography analysis. Unfortunately, the method is 
prone to misjudgments and subjective opinion. Diagnostic error 
rate ranges vary but have been reported as ranging between 20 % 
and 43 % [1], and out of all the biopsies performed in suspicious 
mammograms, between 70 % and 89 % of them will be found 
benign [2]. 

This chapter introduces a new approach to improve computer- 
aided diagnostic methods by selecting, from a given data set, a 
subset of the features that would allow the identification of a lesion 
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as an intraductal carcinoma or otherwise, and the different problem 
of deciding whether a lesion is malignant or benign. The technique 
is generic and can be applied to some other knowledge extraction 
methods from available databases [3]. The breast cancer diagnosis 
issue has been addressed by Kovalerchuk et al. in [4]. In our work 
we use the same breast cancer data set, which is composed of 149 
samples and 16 features. The data is unidentified and was obtained 
from patients’ exams at the Woman’s Hospital of Baton Rouge, 
Louisiana, USA. The features include a number of calcifications, 
irregularities in shape and size of the calcifications and density of the 
lesion, among others. We refer to [4] and the supplementary web 
page for more information on the dataset. 

The goal is to find a set of features that can be used to perfectly 
classify all cases. At this point, we must emphasize that we are not 
over-fitting and that this method should not be used as a stand¬ 
alone procedure. In fact, classifiers with a high generalization ability 
(i.e., that can correctly classify new samples) might arise when one 
applies methods such as neural networks, but only using features 
that are relevant. This work concerns the problem of finding such 
features. 


2 The k-Feature Set Problem 

The ^-Feature Set problem has undoubtedly many applications in 
knowledge extraction from life sciences and medical databases. It 
appears as a crucial component in several areas, such as gene dis¬ 
covery, disease diagnosis, drug discovery or pharmacogenomics, 
toxicogenomics, and cancer research. It can be formalized as 
follows [5]: 

2.1 k-Feature Set Input : A set X of m examples (which are composed of a binary value 

(decision version) specifying the value of the target feature and a vector of n binary 

values specifying the values of the other features) and an integer 

k > 0. 

Question : Does there exist a set S of non-target features (i.e., S C {1, 
..., n}) such that: 

• \S\ < k 

• No two examples in X that have identical values for all the 
features in S have different values for the target feature? 

Clearly, this problem could be extended to an alphabet which is 
not binary. This is the case with the data used in this work. 
Although there are binary values attributed to some of the features, 
for some of them we have a ternary or quaternary alphabet, and one 
with rational values. We return to this issue later. An example can be 
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Table 1 

An instance of the k -Feature Set problem 


Example # 

FI 

F2 

F3 

F4 

Target 

1 

1 

0 

0 

1 

1 

2 

0 

0 

1 

1 

1 

3 

1 

1 

0 

0 

0 

4 

0 

1 

0 

1 

0 

5 

0 

0 

1 

0 

0 


Columns represent features; rows are samples. The last column represents the target of 
each sample. Features have a binary representation: “1” if they are true for that sample; 
“0” otherwise 


seen in Table 1: here we have replaced the value of “1” for “good” 
(for instance, absence of cancer) and “0” for “bad” (cancer). 

Referring to the example given in Table 1 , the reader will note 
three feature sets with cardinality k = 3: SI = {FI; F2; F4}, 
S2 = {Fl;F3;F4},andS3 = {F2; F3; F4}. The question of interest 
is if there is a feature set with cardinality k = 2. 

Computer scientists are always interested in determining the 
computational complexity of the basic problem. This decision 
problem has already shown to be NP-complete by a reduction 
from Vertex Cover [5]; another problem belonging to this class. 
NP-completeness is generally viewed as a reflection of the “compu¬ 
tational intractability” of a given mathematical problem. What this 
typically means, in practical terms, is that it is highly improbable 
that we will be able to find an algorithm that solves this problem 
efficiently for general instances. Efficiently, in computational terms, 
means a time which is proportional to a polynomial function of the 
size of the input. 

This does not mean that the problem cannot be addressed by 
other means. For small instances or for small values of fe, it may still 
be possible to solve it with exact algorithms. Although they may 
have exponential worst-case behavior, it may be possible that such 
algorithms can provide the optimal solution in reasonable amounts 
of time. On the other hand, for large instances, there are now 
powerful new mathematical methods, generically named metaheur¬ 
istics, which would allow solving these problems in practice for 
large instances. Some of these methods are based on the evolution 
of alternative feasible solutions for the optimization versions of the 
problem. Examples of this type are Genetic [6] and Memetic Algo¬ 
rithms [7]. An important research question, particularly for the 
analysis of large instances of this problem in the field of Functional 
Genetics, is if there is a fixed-parameter algorithm for this problem. 
In 2003, Cotta and Moscato proved that the problem is not only 
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NP-complete but W[2]-complete [8]. This means it is likely that a 
fixed-parameter tractable algorithm does not exist for this problem. 

The W[2]-completeness of ^-Feature Set shows that attention 
should be concentrated on finding efficient rules that do not 
remove the optimal solution, but can help to reduce the size of 
the instance. In the parameterized complexity context, this is called 
reduction to a problem kernel. This will allow a search algorithm to 
increase the chances of finding the optimal solution and may enable 
exponential-time exact searches for smaller instances. 

We now discuss the characteristics of the application we are 
using to introduce this methodology. We then discuss the generali¬ 
zation of ^-Feature Set and how the problem can be formalized as 
an optimization problem in a particular type of graph. 


3 The Breast Cancer Data Set—Initial Preprocessing 

As mentioned before, the breast cancer data set used in this work is 
the same as used by Kovalerchuk et al. [9]. What is available in the 
public domain is a small subset of a larger database related to tumor 
exams from patients at the Woman’s Hospital of Baton Rouge, 
Louisiana, USA. We refer the reader to the authors’ webpage. 1 
The reader will identify that there are 149 examples and that each 
example in the dataset has 17 features (corresponding to the attri¬ 
butes numbered from 2 to 18 inclusive). 

With this data, we conducted two independent tests, each with 
a binary target. The first was to predict an intraductal carcinoma, 
and the second to identify whether the tumor is malignant or 
benign. From the 17 features, we only consider the 16 which are 
represented by a finite alphabet for our study. This departs from [9] 
and leaves out from our data set the feature #2: “Approximate 
volume of the lesion in cubic centimeters”. This does not mean 
that we consider the volume of the lesion “irrelevant” at all. When a 
given feature that can assume any integer value (or any value in an 
infinite alphabet), a common approach is to determine appropriate 
thresholds for each of the features. These problems are generally 
known as optimal thresholding problems and in some cases they 
lead to other NP-hard problems. In particular, the decision prob¬ 
lem for the thresholding for ^-Feature Set is NP-complete [3]. 

Considering the 149 examples present in the dataset, there are 
two samples (#102 and #134, following the original dataset num¬ 
bering) that have the same values for all features, except for the 
volume of calcifications, but have differing target values. Even for 
the volume of calcifications, the difference is minimal—0.048 
against 0.072—considering that the observed values range between 


1 http://www.csc.lsu.edu/trianta/ResearchAreas/DigitalMammography/index.html. 
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0.008 and 218.416. Since the ^-Feature Set problem does not 
allow different outcomes for the same feature pattern, we decided 
to remove both examples. 

Another inconsistency found was with sample #140. The value 
of attribute “ductal orientation” can only take either yes/no values 
(labeled “A” and “B” respectively), but in example #140 there is an 
invalid entry “D.” Since it was not possible to infer the correct 
value, we ignored the sample. This leaves us with 146 samples with 
16 features. The sixteen features xi, ..., x i6 we consider are the 
following: 

• x x —number of calcifications per cm 2 (A: <10; B: 10 to 20; 
C: >20) 

• x 2 —total number of calcifications (A: <10;B: 10to30;C:>30) 

• x 3 —irregularity in shape of calcifications (A: mild; B: moderate; 
C: marked) 

• x 4 —variation in the shape of calcif. (A: mild; B: moderate; C: 
marked) 

• X 5 —irregularity in the size of calcif. (A: mild; B: moderate; C: 
marked) 

• x 6 —variation in the density of calcif. (A: mild; B: moderate; C: 
marked) 

• X 7 -X 11 —Le Gal type of lesion (A given lesion may contain several 
types) 

• x 12 —ductal orientation (A: yes; B: no) 

• X 13 —density of the calcifications (A: low; B: moderate; C: high) 

• x 14 —density of the parenchyma (A: low; B: moderate; C: high) 

• Xx 5 —comparison with previous exam (A: change in the number 
or character of calcifications; B: not defined; C: newly devel¬ 
oped; D: no previous exam) 

• x 16 —associated findings (A: multifocality; B: architectural dis¬ 
tortion; C: mass; D: none) 

The thresholds are the same used in Kovalerchuch et al. [4]. 
The target feature separates the examples into two main groups. 
Each example can be classified in the first test as A: intraductal 
carcinoma or B: everything else (i.e., other cases), totaling 37 and 
109 examples for each group respectively. In the second test we 
classify the examples as A: malignant (71 examples) or B: benign 
(75 examples). 
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4 The (a, /?)-/r-Feature Set Problem 

In this combinatorial optimization problem the practical objective 
is to find the minimum set of features that can differentiate every 
pair of examples belonging to different classes (see Note 1). In 
addition, this feature set must also explain every pair of examples 
that belong to the same class. The greater the parameters a and /?, 
the greater the reliability of the classification system, at the expense 
of a larger optimal feature set. Formally, the decision version of the 
(a, P)~k -feature set problem we are addressing in this chapter is the 
following: 

4.1 (a, fi)-k-Feature Input : A set of m examples X = {x^, ..., x^}, such that for all z, 

Set (Decision Version) x (?) = {x x (,) , x 2 w , ..., x n {l \ t (?) } e {0,l} n+1 , and three integers 

k > 0, and a, (3 > 0. 

Question : Does there exist an (a, /?)-&-feature set S, i.e., SC {1 ,..., 
w}, with \S\ < k and such that: 

• for all pairs of examples (x,-, Xy), i ^ /; if t (i) ^ t (j) there exists 
S' C S such that \S' \ > a and for all l c S' x^ ^ x^ J \ and 

• for all pairs of examples i ^ /, if t( i) = t (j) there exists S C S 
such that \S'\> P and for all l c S' x^ = x x ^ ? 

In the definition above the set S! is not fixed for all pairs of 
examples, but it is a function of the pair of examples chosen, so we 
mean S' = S(i, j). The basic idea is to improve robustness of the 
original method by allowing some redundancy in example discrim¬ 
ination. We seek to have at least a features for differentiating 
between any two samples of different classes. Similarly, we want to 
have at least P features with consistent values for any two samples of 
the same class. Note that each feature may have its own distinct 
alphabet. 

Clearly this problem is also NP-complete (the ^-feature set 
problem is a special case with a — 1 and P = 0). We also note 
that this naturally leads to a multiobjective optimization problem 
in which, for a given input data, we try to maximize the values of a, 
P > 0 and at the same time minimize the value of k > 0. 


5 The (a, p)-k -Feature Set as an Optimization Problem in Graphs 

It is very useful to reformulate the (a, P)-k -feature set as an optimi¬ 
zation problem in a graph. This is beneficial as it allows the use of 
powerful reduction techniques that significantly reduce the compu¬ 
tational effort. Next, we define how to create a bipartite graph G( V, 
E) from an instance of the problem. 
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Initially, partition the set of nodes of the graph into three 
disjoint sets A, B and F such that A U B U F = V. We have said 
that the graph is bipartite so this may seem confusing at first glance, 
let us note that one of the partitions of Gis the set F and the other is 
AU B. 

We will now proceed to define how we build the graph, starting 
with the set of vertices. Each node in the first subset of vertices (A) 
represents each unique pair of examples which have different target 
values. Analogously, we define B as the subset of nodes of G such 
that each node corresponds to a pair of examples which have the 
same target value. Each node in the remaining subset F represents a 
different feature. 

We now define the edge set E. There are only two types of 
edges, those that connect nodes in Ewith nodes in A and those that 
connect nodes in F with nodes in B. There is an edge between a 
node / ^ F and a node a ^ A if and only if the pair of examples that 
node a represents have different values for feature / 

Analogously, there is an edge between a node / ^ E and a node 
b ^ B if and only if the pair of examples that node b represents they 
have the same value for feature/ The graph appearance is shown in 
Fig. 1, labeled as “Original Graph”. 




Fig. 1 Graph representation of the (a, /?)-/c-Feature Set problem. The top-left diagram shows the graph 
constructed from the example in Table 1. The other three show the action of the reduction rules. Rule #1 
searches for pairs of examples that are explained by a single feature. Rule #2 looks for features that explain 
pairs or examples already covered by another feature. Rule #3 searches for pairs of examples that are 
irrelevant from the graph domination point-of-view 
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For the data we are considering, direct application of this graph 
transformation leads to a large graph, with 10,731 nodes and 
87,341 edges. We show how some powerful reduction techniques 
can then be applied. They will help by selecting “relevant” features 
(features that must be in the optimal solution), irrelevant features 
(features that contribute nothing to the optimal solution) and by 
removing nodes from A and B that do not provide any extra 
information. These reductions are safe as no information is lost in 
the process, thus even though we may have a drastic reduction in 
the size of the graph, we can still find an optimal solution in the 
graph. 


6 Reduction Techniques 


The application of the reduction rules for the generic problem of 
knowledge discovery we address here is inspired by the influential 
work of Weihe [10]. He showed how they can be very useful for 
solving large, real-world optimization problems. As the graph built 
from the instance of the problem under study now contains nodes 
which have degree one (in both A and 5), we can at most have one 
feature which explains the differences/similarities of the pairs of 
examples represented by these nodes, and thus this data set allows 
at maximum a = fi = 1. For this reason we explain the reduction 
rules specifically for this condition. If a and /? could assume larger 
values, the rules would be slightly different, although the principle 
remains the same. Illustrations of the following rules can be seen in 
Fig. 1. 


6.1 Reduction Rule 1: 
Relevant and 
Mandatory Features 


This rule searches for features that must be in any minimal cardi¬ 
nality (1, 1)-^-feature set. If there is a node in either A (or B ) that 
has degree one, the feature at the other endpoint of the connecting 
edge must be in the feature set. 


6.2 Reduction Rule 2: This rule helps to identify features that can be considered irrelevant. 

Irrelevant Features A feature can be deemed irrelevant if it can only explain pairs of 

examples which are a subset of other pairs already explained by 
another feature. When we say “explain” we refer to the fact that 
the presence of a feature can account for the difference or the 
similarity between a pair of examples. Let N A1 and N A2 be subsets 
of pairs of examples belonging to A, such that N A1 C N A2 . Then, 
consider that every element in N A] is connected to feature f and 
that all elements in N A2 are connected to a different feature f r 
Moreover, let N B1 and N B2 be subsets of pairs of examples belong¬ 
ing to B , such that N B1 C N B2 . Then, consider that every element 
in N B1 is connected to / ■ and that all elements in N B2 are connected 
to fj. Under such conditions, featureis irrelevant and will not be 
present in the optimal feature set. If two or more features explain 
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exactly the same pairs of examples, they are condensed into a single 
new feature. Then, if such new feature is selected as optimal, the 
researcher can analyze its components individually and decide 
which to use. 


The third reduction rule is motivated by the aim to efficiently find 
pairs of examples whose differences (or similarities) are already 
accounted for when dealing with another pair of examples. Let 
FI C F2 C i 7 , where Vj ^ A fi B is connected to all of FI and 
V 2 ^ A n B is connected to all of F2. In this case, a suitable set of 
features in FI that dominates Vj will automatically dominate too. 
Thus we can delete v 2 ? as it provides no extra information about the 
features required for the feature set. Here note that we make no 
distinction between example vertices from A and B. This generali¬ 
zation only holds when a = /?. If a ^ ft we must use the more 
generic rule presented in [3]. 

6.4 Recursion For a more general version of these rules, which consider any a and 

P values, please refer to Cotta, Sloper and Moscato [3]. The reduc¬ 
tion rules are recursively applied on the original graph in a sequen¬ 
tial way, starting with rule 1, until no reduction can be obtained 
anymore by any of the three. 


6.3 Reduction Ruie 3: 
Redundant Pairs of 
Examples 


7 Memetic algorithm 


As mentioned before, the decision version of the (a, fi)-k -feature set 
problem is NP-complete. Therefore, complete enumeration or 
exact solution search methods [11] can only be used when the 
graph is rather small, since the computational complexity grows 
exponentially with the number of features and samples. Most times, 
even after the preprocessing step, the graph still remains too large 
for exact methods to be used in practice. 

In such cases, it would be wise to resort to stochastic or 
informed search algorithms, or more powerful metaheuristics, so 
as to provide high quality solutions. Towards illustrating this aim, 
we have implemented a population-based metaheuristic; a memetic 
algorithm [7, 12, 13], and we give some details of its implementa¬ 
tion. We focus on the main aspects, which are important for the 
Feature Set problem itself. 

7.1 Representation The representation chosen for the Feature Set problem assigns each 

and Recombination feature to a position in a binary array. The positions can assume 

true/false values that will indicate whether the corresponding fea¬ 
ture is in the feature set (effective) or not (ineffective). In a graph 
context, if a feature becomes ineffective, the corresponding node 
and all edges connected to it are erased. 
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Recombination creates new solutions by combining informa¬ 
tion taken from two original solutions. In this work, the recombi¬ 
nation has both deterministic and stochastic aspects. The first phase 
is deterministic, where features which are effective in both original 
solutions are set as effective in the new one. The rest of the new 
solution is completed by randomly choosing an ineffective feature 
and making it effective, until the solution becomes feasible. 

7.2 Local Search Local Search (LS) is applied to all new solutions created through 

recombination, as is usual in many memetic algorithms. The goal is 
to improve the solution by testing a series of changes—related to a 
neighborhood definition—in the solution and keeping the changes 
that actually improve the solution’s quality with regard to the 
objective function. Since only feasible solutions are generated by 
the recombination procedure, the only avenue for improvement is 
to reduce the number of effective features in the solution. In order 
to do so, three neighborhoods were tested. 

The first neighborhood sequentially selects every effective fea¬ 
ture and makes it ineffective. If the resulting solution is still feasible, 
then the feature set size was reduced by one. If the solution became 
infeasible, then the feature returns to its original state. 

The second neighborhood tries to reduce the number of fea¬ 
tures by removing two features from the feature set and adding only 
one new feature. It initially selects two effective features, making 
them ineffective. Then it selects an ineffective feature, different 
from the two removed, and makes it effective. If the solution 
remains feasible, the feature set size was reduced, otherwise all 
features return to their original states. 

The last neighborhood is an extension of the second one. 
Instead of extracting two effective features, it extracts three and 
adds two, different from those removed. The dimension of this 
third neighborhood is very large and it could only be applied 
because the resulting graph after the reduction rules was very 
small. Depending on the instance size, only the two smaller neigh¬ 
borhoods might be used. The use of local search techniques for 
such a small instance is not necessary at all. However, when dealing 
with larger instances, where even after the use of the reduction rules 
the search space is still considerably large, its use will become 
imperative. 

The three local searches are applied to the solution in a sequen¬ 
tial way, until no further improvement can be obtained. When this 
happens, we conclude the solution has reached a local minimum for 
all neighborhoods and stop the process. 
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8 Computational Results 

The algorithms described above were implemented using the Java 
Programming Language (JDK 1.7.0) on a Pentium 4 HT PC 
running at 3.0 GHz, with 1 GB RAM. The CPU time required 
by the reduction techniques was quite considerable, just over 
78 min, but still drastically smaller than many current approaches 
for solving NP-hard problems. As this was a pilot study on the 
power of this tandem approach, no systematic effort was conducted 
towards implementation details that could have delivered a speed 
up of the reduction rules. Although the CPU time was significant, 
application of the reduction rules was very beneficial. The reduced 
graph became an almost trivial task for the memetic algorithm. 

The three reduction techniques give very good results for both 
instances under consideration. In Table 2 we present figures that 
show the magnitude of the reduction obtained. The decrease in the 
size of the instances is impressive, both in terms of nodes and edges. 
Again, we must emphasize that the reduction is not a heuristic, and 
it is a safe procedure as a minimum cardinality (a, P)-k -feature set is 
still obtainable from the output of the reductions and the optimal 
solution of the reduced graph. Concerning the features, for intra¬ 
ductal carcinoma vs. other cases, the reduction rules found that x 7 , 
x 8 , x 9 , and X 15 must be in the feature set, and feature x n should be 
discarded. It is interesting to remark that x xl (indicating one of the 
Le Gal types of the lesion) was not relevant (at least if we include the 
other three Le Gal type features x 7 , x 8 , and x 9 ) and the presence of 
X 15 reinforces the relevance of including the comparison with pre¬ 
vious exam as an aid to the diagnosis. For malignant vs. benign, the 
reduction rules found that x 2 , x 7 , x 8 , and x 14 must be in the feature 
set, but none of the remaining features were ruled out. 

On the reduced graph, the memetic algorithm found an 
(a = 1 , P = 1 )-^-feature set with k = 6 features for the intraductal 


Table 2 

Results for the breast cancer-related graph reduction 


Instance 

# of nodes 

# of edges 

# of features 

Intraductal carcinoma test (original) 

10,731 

87,341 

16 

Intraductal carcinoma test (reduced) 

32 

93 

11 

Reduction 

99.7 % 

99.9 % 

31.2 % 

Malignancy test (original) 

10,585 

88,460 

16 

Malignancy test (reduced) 

31 

91 

12 

Reduction 

99.7 % 

99.9 % 

25.0 % 


Notice the extreme reduction in the graph’s size—always more than 99 % for the number of nodes and edges. The 
reduction in the number of features was also significant 
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Table 3 

Feature set results for a = fi = 

1 for the memetic algorithm 

Memetic algorithm test 

Feature set elements—(# of features) 

Intraductal carcinoma vs. other 

x 2, x 5, x 6, x 7, x 8, x 9, x 12, x 13, x 15, X U>—(10) 

Malignant vs. benign 

x 2 , x 3 , x 5 , x 7 , x 8 , x 12 , x 13 , x 14 , x 15 —(9) 


We show in boldface those features that have been already identified by the reduction rules alone. From the 11 features 
after the reduction procedure (see Table 2), ten are needed to perfectly classify intraductal carcinomas. Similarly, nine 
features out of the 12 remaining after the reduction are needed to classify malignancy 


carcinoma case. After testing all the possible solutions with five or 
less features—a brute force procedure could be used because of the 
instance size—none of the 462 sets was found feasible. That means 
the minimum size of a (1, l)-k -feature set that explains the data set 
is ten—four from the kernel plus six from the memetic algorithm. 

Regarding the malignant case, this time the memetic algorithm 
found a feature set of cardinality five. That translates into nine 
features to explain the data set—five from the memetic algorithm 
and four from the kernel. This was also confirmed to be the mini¬ 
mum cardinality feature set possible. The complete enumeration by 
brute force took just over 10 s while both the MA and the Greedy 
procedure took a fraction of a second. In Table 3 we present the 
number of features obtained by the two algorithms. 


9 Classification Rules 

After the determination of the relevant features, the determination 
of classification rules comes naturally. Considering the intraductal 
carcinoma case, Table 4 presents the list of rules that classify the 
data based on the feature set found by the memetic procedure. 
Table 5 presents the rules associated with the malignant vs. benign 
cases, attained from the results of the memetic algorithm. 

The rules were found using the WEKA data mining software 
package (http://www.cs.waikato.ac.nz/ml/weka/). WEKA employs 
several traditional techniques such as ID3 and C4.5 [14]. We have 
used the PART heuristic from this package, which uses a divide-and- 
conquer approach to build these rules. It builds a partial C4.5 decision 
tree in each iteration, and turns the “best” leaf into a rule. The rules 
should be used in a cascaded manner (i.e., successive if-else state¬ 
ments). Thus, it must be initially checked if the given example satisfies 
the first rule. If it does not, we proceed checking if it satisfies the 
second one, and so on, until the last rule. The initial rules are able to 
better discriminate a larger number of examples, mainly because—at 
this stage—most of the examples are still unclassified. However, as we 
approach the end, almost all examples have already been classified by 






Computer-Aided Breast Cancer Diagnosis with Optimal... 311 

Table 4 

Classification rules for the memetic algorithm—intraductal carcinoma vs. other 
Rule Rule’s clauses (# of samples covered—intraductal carcinoma? [Y/N]) 

1 ^(x 5 = C) A ~'(x 6 = C) A (x 7 = 0) A (x 8 = 2) A (x 9 = 0) A (Xi 2 = B) A -'(Xi 5 = B) A 

-(x 16 = B) (24-N) 

2 ~'(x 2 = C) A “'(x 6 = C) A -'(xig = B) A (xi 6 = B) (9 - N) 

3 -<x 6 = C) A (x 7 = 1) A (x 9 = 0) A ^(x 15 = B) (8 - N) 

4 ~'(x 2 = C) A ^(x 5 = B) A -<x 6 = C) A (x 7 = 0) A (x 8 = 0) A (x i2 = B) A -<xi 5 = B) A 

-(x 15 = C) A (x 16 = D) (14 - N) 

5 - , (x 5 = A) A -(x 6 = C) A -(x 13 = B) A -(x 15 = A) A -(x 15 = B) (9 - N) 

6 (x 6 = C) A (x 15 = C)(6-Y) 

7 -(x 6 = B) A (x 7 = 1) A -(x 15 = B) (4 - N) 

8 -(x 5 = B) A (x 7 = 0) A -(x 15 = B) A (x 16 = B) (3 - Y) 

9 (x 2 = B) A ^(x 5 = A) A (x 7 = 0) A (x 8 = 0) A(x 9 = 0) A - , (xi 3 = A) A 

-(x ls = B) A (x 16 = C) (5 - N) 

10 ^(x 6 = B) A (x 7 = 0) A (x 8 = 0) A ““'(x 13 = C) A (x 15 = C) A ~'(xi 6 = B) (4 - Y) 

11 -(x 2 = C) A (x 5 = B) A -(x 6 = B) A (x 7 = 0) A -(x 15 = B) A -(x 15 = C) A -(x 16 = B) 

(8-N) 

12 ~'(x 2 = C) A (x 7 = 0) A (x 8 = 2) A ^(x 15 = B) A ^(x 15 = C) A ^(xj 6 = B) (5 - Y) 

13 (x 2 = B) A ~ '(x 6 = A) A (x 7 = 0) A ~ '(xi 3 = C) A ~ '(xis = B) A ^(xi 5 = C) (6 - Y) 

14 ~ '(x 2 = A) A (x 6 = A) A (x 7 = 0) A (x 13 = A) A ^(x is = B) A ^(x 16 = C) (8 - N) 

15 (x 7 = 0) A (x 15 = C)(3-N) 

16 (x 5 = A) A (x 7 = 0) A ““ '(x 13 = A) A ^(x 15 = A) A ^(x 15 = B) A ^(x 16 = B) (2 - Y) 

17 -(x 2 = C) A (x 7 = 0) A -(x 15 = B) (4 - N) 

18 ^(x 5 = A) A “ '(x 6 = B) A (x 7 = 0) A (x 8 = 2) A “ '(x 13 = A) A ^(x 16 = B) (3 - N) 

19 (x 13 =A) A-(x 16 = B)(2-Y) 

20 -(x 5 = A) A ~'(x 6 = B) A (x 9 = 3) A -(x 13 = A) (2 - N) 

21 ^(x 5 = B) A (x 8 = 0) A (x 9 = 0) A (xj 2 = A) A (xj 3 = B) (3 - N) 

22 (x 12 = A) A -(x 13 = A) A (x 15 = A) A -(x 16 = C) (4 - Y) 

23 (x 8 = 0) A (x 15 = A) A -(x 16 = A) (4 - N) 

24 -(x 13 =A)(5-Y) 

25 All the rest (1-N) 

There are 25 rules in total, which should be read in a cascading way. That is, rule #1 classifies 24 samples as non- 
intraductal carcinoma. For the remaining samples, rule #2 classifies 9 of them as non-intraductal carcinoma again, and so 
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Table 5 

Classification rules for the memetic algorithm—Malignant vs. Benign 


Rule 

Rule’s clauses (# of samples covered—Malignant? [Y/N]) 

1 

~ '(x 3 = B) A (x 7 = 1) (10 - N) 

2 

-(x 3 = A) A (x 12 = A) A (x 13 = B) (15 - Y) 

3 

(x 2 = C) A ““ '(x 3 = A) A (x 5 = C) A (x 12 = B) A -■(x 13 = A) (6 - Y) 

4 

(x 3 = B) A (x 15 = C) (4-Y) 

5 

^(x 5 = C) A (xg = 2) A (x 12 = B) A (x 13 = C) A ~'(xj 5 = B) (8 - N) 

6 

(x 2 = B) A ““ '(x 3 = B) A -'(x s = C) A (x 7 = 0) A (x 8 = 2) A (x 12 = B) A -’(x 15 = B) A 
-(x 15 = C)(S-N) 

7 

(x 2 = A) A “ ’(x 3 = A) A ^(x 15 = B) (13 -N) 

8 

~ '(x 2 = C) A ^(x 5 = A) A (x 7 = 0) A (x 12 = B) A (x 13 = A) A (x 15 = A) (6 - Y) 

9 

~ '(x 2 = A) A (x 3 = C) A ^(x 5 = B) A (x g = 0) A (x 13 = C) A ^(x 14 = C) A -'(x 15 = B) A 
-(x ls = C)(3-Y) 

10 

(x 7 = 0) A (x 8 = 0) A (x 14 = C) A (x 15 = A) (5 - N) 

11 

(x 3 = A) A (X 14 = C) A -<xi 5 = B) (4-Y) 

12 

~ '(x 2 = B) A (x 8 = 2) A --(X 13 = C) A §(x 15 = A) A -'(x 1 * = B) (10 - N) 

13 

(x 3 = A) A (x 7 = 0) A -(x lS = A) A -(x is = B) (7 - Y) 

14 

(x 5 = B) A (x 7 = 0) A (xi 5 = D) (4-Y) 

15 

(x 7 = 0) A -(xi 3 = C) A -(x 14 = A) A (X15 = D) (4-Y) 

16 

~ '(x 2 = A) A “ '(x 3 = A) A (x 5 = A) A -■(x 14 = A) A |(x 15 j = B) (7 - N) 

17 

~ '(x 2 = C) A (x 7 = 0) A (xg = 2) A '(x 14 = C) A -'(x is = B) (5 - Y) 

18 

(x 2 = B) A (x 7 = 0) A (xi 5 = D) (3 -N) 

19 

(x 2 = C) A “ ’(x 3 = B) A (x 7 = 0) A -'(x 13 = B) A -'(x is = B) A ^(x 15 = D) (5 -N) 

20 

~ '(x 2 = A) A ““ '(x 3 = C) A (x 5 = B) A (x 7 = 0) A ~ '(x 13 = C) A ^(x 14 = C) A 
f(x ls = B) A^(x 15 = D)(4-Y) 

21 

(x 12 = A) (3 - N) 

22 

-(x 2 = A) A (x 3 ! A) (3 - Y) 

23 

(x 2 = A) (2-N) 

24 

(x 5 = A) A (x 7 = 0) A (xg = 0) A (x 13 = B) A “'(xis = C) (2 - Y) 

25 

(x 13 = B)(2-N) 

26 

(x 2 = B) A -(x 3 = B) A (x 14 = B) (2 - Y) 

27 

(x 2 = B) (2 - N) 

28 

All the rest (2 - N) 


The rules should be interpreted as in Table 4 
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the previous rules and thus only outliers remained. The consequence is 
a reduction in the number of samples being classified by the last rules. 

Using the features provided by the memetic algorithm for the 
intraductal carcinoma case, the total number of rules was 25 (see 
Table 4 and Note 2). For the malignant vs. benign case (Table 5 
and Note 3) the PART heuristic returned 28 rules. 

One may be inclined to believe that the number of rules created 
may be a symptom of over-fitting. Again, we must emphasize that 
we are not over-fitting (at least not in the negative sense of loss of 
generalization). What we aim to do is to fit exactly in order to find 
relevant features, in which future classification efforts should be 
concentrated. The rules presented in Tables 4 and 5 are a contribu¬ 
tion to this classification effort, but with no generalization preten¬ 
sions. Again, we reinforce that the rules must be applied in cascade 
and no rule can be interpreted stand-alone. For this reason, even 
though the last rules have fewer features, they are still relevant 
because all previous rules are required to be false. Following this 
idea, the last rules might actually rely on information present in as 
many features as the initial rules. 

Concerning the trade-off between the number of features and 
the number of classification rules, two aspects must be taken into 
account. The first is that the use of more features than necessary— 
i.e., a superset of the optimal feature set—does not necessarily 
improve the reliability of the rules. Indeed, given a certain number 
of examples to work with, finding the optimal set of features that 
can explain the data is expected to improve the a priori generaliza¬ 
tion capability of the generated rules. The second aspect, which 
may be important in other application areas, not necessarily this one 
but it would be relevant to mention en passant, is related to cost of 
data collection. Working with more features generally means that 
more time—and financial resources, particularly in the clinical 
domain,—is going to be spent on extracting the same information 
from the data set. 


10 Discussion 


In this section we discuss the results we have obtained in the 
malignant \ s. benign and the intraductal carcinoma vs. other clas¬ 
sification tasks from the database used. We first note that our 
conclusions are based only on this data, and we aim at pointing 
out some conclusions from this study only. We also expect that a 
larger sample database will lead to the development of other types 
of classification rules or give more support to the ones obtained 
here. However, the results need to be put into the perspective of 
other results in the medical literature and this is our aim. 
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10.1 Malignancy vs. 
Benign Classification 


10.2 Intraductal 
Carcinoma 


Considering this classification, one of the most relevant works is 
from Kovalerchuk et al. [9]. Therein, the authors obtain two rules 
from radiology experts that are commonly accepted as indicating 
the possibility of malignant lesions. After translating them into our 
format to identify the features they become: 

• IF (xj = C and x 2 —C and x 3 = C) THEN the lesion is highly 
suspieious for malignancy. 

• IF (xi = C and x 2 —C and x 3 = C and x& = C and x 33 = C) 
THEN the lesion is highly suspieious for malignancy. 

When the rules were applied to the data set, the result was 
unsatisfactory. The first rule classified four malignant cases correctly 
and one benign was classified as malignant. The other 67 malignant 
cases were missed. The second rule had an even worse performance, 
classifying correctly only two samples. It misclassified two benign 
cases as malignant and missed the other 69 malignant cases. The 
conclusion we might draw is that these types of rules, which we may 
label as worst-case rules, could be intuitively appealing but are not 
suitable to help in the diagnosis of this data set. 

Another previous work also dealing with malignant/benign 
diagnosis is Yunus et al. [15]. In their work they show a strong 
correlation between malignancy and five features: xi, x 2 , X 5 , x 6 , and 
x 13 . From these, we have three of them (x 2 , X5, and x 13 ) in the 
optimal feature set found by our algorithm. Also, the dataset used 
in [15] contains 19 cases with Le Gal type 5 and all of them are 
malignant. In the database used in this study, this correlation is not 
so strong but nevertheless very important. From 27 Le Gal type 5 
cases, 21 are malignant, corresponding to a ratio of almost 80 %. In 
our results, the only relevant Le Gal types for optimal discrimina¬ 
tion power were 1 and 2 (of course, complemented by the other 
features in the optimal set). Le Gal type 1 appears only in rule #1 
and the rule classifies ten samples as benign. Le Gal type 2 appears 
in rules #5, # 6 , # 12 , and #17. These rules classify 23 samples as 
benign and five samples as malignant, in total. That gives a 21 % 
malignancy ratio, which is close to the expected malignancy pro¬ 
portion for Le Gal type 2. 

Now we discuss the intraductal carcinoma vs. other cases classifica¬ 
tion task. For this case our algorithm found eight different feature 
sets with 10 features. After finding the set of rules for each of them 
we decided to report the one with the least rules—25 in total. The 
number of rules varied from 25 to 34. A potential indication of the 
importance of a feature set, when there is more than one optimal 
feature set of a given cardinality, is to count the number of times 
that a feature has appeared in one of the optimal solutions; sec¬ 
ondly, the set of rules for all solutions should be generated and their 
sizes checked. 
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10.3 Clustering 
Using the Optimal 
Feature Sets 


The number of times each feature not in the kernel appears in 
the optimal solutions is: xi(2), x 2 (3), x 3 (l), x 4 (2), xs(4), x 6 (6), 
xio(4), x i2 (6), x i3 (7), x i4 (6), and xi 6 (7) (of course, those in the 
kernel appear in all eight optimal solutions). This indicates that, 
beside the fact we selected the optimum feature set with the lowest 
number of rules, it could also be good in terms of the number of 
times its features have appeared in an optimal solution (33 out of a 
maximum of 36). 

A direct comparison with the results of Kovalerchuk et al. [4] is 
not possible since firstly, they use almost all of the features (16 in 
total), however, we proved that only ten features are required for 
the harder task of finding a discriminative feature set which also 
maximizes within class similarity (/? = 1 which is the maximum in 
this case). Secondly, their results are also based on a feature (volume 
of calcifications) which we have initially removed from consider¬ 
ation in this article since we believe there is no support for its 
inclusion. By including it, while not associating particular thresh¬ 
olds that would quantize it, the feature also seems to be generating 
a large number of “poor rules” (using the authors’ own words) or 
rules with a small support in the dataset. 

In the kernel, we have three features corresponding to Le Gal 
types (1,2, and 3) that must be in any optimal solution. This gives 
some support to the usefulness of this classification system. The 
other feature in the kernel is the “comparison with the previous 
exam,” which also seems to be well-correlated with current practice 
and recommendations. It is interesting to remark that the Le Gal 
type 4 also appears in half of the total number of optimal solutions 
while Le Gal type 5 does not appear in any of them. 

To illustrate other aspects of our discussion, we employ a visualiza¬ 
tion approach and clustering algorithm based on our work in gene 
expression data analysis [16, 17]. We refer the reader to those 
articles to understand the details of the method. In essence, the 
aim is to arrange, on two dimensions, a large number of one- 
dimensional arrays (containing the information of interest) in 
such a way that consecutive, or closely placed arrays, are as similar 
as possible. This is an NP-hard optimization problem, but our 
method can obtain either optimal or very close to optimal solu¬ 
tions. We thus provide a permutation of all the samples in the data 
set (without permuting the feature order) which helps to under¬ 
stand the data. 

In Fig. 2 we show two images. The samples were first divided 
according to their target feature and then clustered within each 
group. The goal is to visually identify correlations between the 
features in the optimal solution and the classification itself. If it is 
possible to find features that assume consistently different values for 
different classes, the classification problem becomes easier. On the 
other hand, the absence of any significant difference indicates that 
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Fig. 2 Samples separated into two groups, according to their classification, and then clustered within each 
group using the algorithm of [16]. Intraductal carcinoma vs. other cases (left) and Malignant vs. Benign (right). 
The gray scale used helps to identify the feature attributes for each sample. Images were created using the 
software for data clustering and visualization NBIMiner (Moscato et al. [17]) 


the classification problem is more difficult and the rules need to be 
more complex. 

Beginning with the left image of Fig. 2, the most visually 
relevant feature is x 7 (Le Gal type 1), which is also in the kernel 
found by the reduction rules. For the other features we could not 
see any clear differences between the groups. This indicates that the 
relations between the features and the tumor classification are very 
fuzzy, reflecting the practical difficulty of establishing rules for 
intraductal carcinoma using only individual features. In the right 
image of the same figure we can see the relation between features 
and classification better. Excepting features x 12 , x 13 , x 14 , and x 15 , 
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the others show different average behaviors for both classes. The 
large number of features with distinct behaviors indicate that clas¬ 
sification of malignant vs. benign is easier than intraductal carci¬ 
noma vs other cases. This was also reflected in the different sizes of 
the optimal feature sets found for both cases. 


11 Conclusion 


This work tackled the (a, /?) -^-Feature Set problem using a Graph 
Theoretic approach. One of the main contributions is the new 
insights derived from a transformation from one problem 
(computer-aided rule generation for clinical diagnosis) into a con¬ 
strained graph domination problem. The approach also shows the 
power of three simple reduction techniques used to shrink the size 
of the resulting graph, without losing the optimal solution we seek. 
These reduction techniques were able to eliminate over 99 % of the 
graph’s nodes and edges allowing us to obtain provably optimal 
solutions. 

The reduction rules were applied to two classification pro¬ 
blems: intraductal carcinoma vs. others and malignant vs. benign 
lesions. They found out that x 7 , x 8 , x 9 , and X 15 (Le Gal types I, II, 
III and “comparison with previous exam,” respectively) must be in 
the feature set, and feature xl 1 (Le Gal type V) should be discarded 
for the intraductal carcinoma case, and that features x 2 , x 7 , x 8 , and 
X 14 (“total number of calcifications,” Le Gal types I, II and “density 
of the parenchyma”) must be in the feature set for the malignant vs. 
benign dichotomy classification. The original problem, which 
would be a challenge for most optimization techniques, became a 
much more tractable one, in computational terms, after the reduc¬ 
tion rules were used. 

As well as the reduction techniques, we implemented a meme- 
tic algorithm (MA) to find the optimum feature sets for both cases. 
The MA was able to reach solutions proved to be optimal after an 
exhaustive search was conducted. The method found feature sets 
with sizes 10 and 9 for the intraductal carcinoma and malignancy 
problems, respectively. 

Finally, we determined a set of classification rules, obtained 
from the feature sets returned by the MA. These rules can, in 
principle, be applied to classify breast cancer tumors, although 
generalization concerns might arise. A promising direction for 
future research might arise if the algorithm introduced here is 
used to determine relevant features that can be used by other 
classification techniques, such as neural networks. However it is 
necessary that additional tests with instances composed of many 
more samples be conducted. 

Our method also points out the importance of large-scale 
combinatorial optimization models and exact and heuristic 
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techniques coupled with large databases of population-studies of 
breast cancer to help evolve better computer-aided predictions. 
The generalization capability is likely to improve when the number 
of samples increases to larger values, on the order of thousands of 
samples. The relevance of the reduction rules in analyzing the data 
suggests that this is an exciting future area of research indicating 
that computer automated methods for knowledge extraction from 
large medical databases of population-wide studies are a viable 
alternative to traditional methods. We also support the availability 
of these databases on public domain in raw format (instead of 
expert quantization of the attributes), since different thresholding 
techniques such as those from [3] could lead to smaller feature sets 
with higher robustness. 


12 Notes 


1. Since its introduction, the approach of employing data mining 
techniques in high-dimensional datasets by using the (a, 
feature set problem as a combinatorial model to reduce the 
dimensionality has found several applications. Its applications 
in the selection of biomarkers for prediction of Alzheimer’s 
disease [18-20], transcriptomic analyses of brain tissues 
[21-23], and identification of multiple sclerosis biomarkers in 
whole blood [24] were of great importance. It also helped to 
produce the first detection of childhood absence epilepsy using 
basal clinical EEGs [25], and has been applied in cancer research 
[26]. Integer programming formulations and current commer¬ 
cial software showed that the approach scales well in practice 
[11]. Our combinatorial approach offers a new alternative to 
statistical-only univariate procedures for the detection of bio¬ 
markers [26-28]. For clinicians, we think that our work could 
entice the interest in Le Gal’s classification of microcalcifications 
[29-31] and their role as a possible early marker for pattern 
recognition panels [32-43]. The use of the PART heuristic 
leads to one particular approach to generate rules for classifica¬ 
tion and we have included this particular heuristic in our study as 
a novelty to the clinical community. We note however, that after 
employing the (a, /?)-£-feature set approach, the information 
provided by the reduced set of features can be used as an input 
for ensemble-based classifiers. This active area of research in 
machine learning has only recently been applied to problems in 
breast cancer [44-50]. We also expect that techniques based on 
mining disjunctive closed item sets could be used after feature 
selection [51, 52]. We expect that the synergies coming from the 
combination of these techniques will soon translate in improved 
early tests for the diagnosis of this disease. 
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2. In this Note we present the first eight rules from Table 4, 
obtained from the feature set generated with the memetic algo¬ 
rithm for the intraductal carcinoma case in natural language. As 
noted earlier in the chapter, these rules operate in a cascading 
fashion, that is, each rule must be applied in order, beginning 
with rule #1, until a rule applies. 

• Rule 1: If the following conditions are true THEN the pre¬ 
diction is NO for intraductal carcinoma (24 samples 
classified). 

Conditions: 

The irregularity in the size of calcifications is not marked 

and the variation in the density of the calcifications is not 
marked 

and Le Gal type is not #1 

and Le Gal type is #2 

and Le Gal type is not #3 

and there is no ductal orientation 

and the comparison with previous exam is defined (not 
“not defined”) 

and the associated findings do not show architectural 
distortion. 

• Rule 2: If the following conditions are true, and the previous 
rule does not apply, THEN the prediction is NO for intra¬ 
ductal carcinoma (9 samples classified). 

Conditions: 

The total number of calcifications is less than (or equal to) 30 

and the variation in the density of the calcifications is not 
marked 

and the comparison with previous exam is defined 

and the associated findings do show architectural 
distortion. 

• Rule 3: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is NO for intraduc¬ 
tal carcinoma (8 samples classified). 

Conditions: 

The variation in the density of the calcifications is not marked 
and Le Gal type is #1 
and Le Gal type is not #3 

and the comparison with previous exam is defined. 

• Rule 4\ If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is NO for intraduc¬ 
tal carcinoma (14 samples classified). 


Conditions: 

The total number of calcifications is less than (or equal to) 30 


and 

the irregularity in the size of the calcifications 
moderate 

is not 

and 

the variation in the density of the calcifications 
marked 

is not 

and 

Le Gal type is not #1 


and 

Le Gal type is not #2 


and 

there is no ductal orientation 


and 

the comparison with previous exam is defined 


and 

the comparison with previous exam is not 
developed 

newly 

and 

there are no associated findings. 



Rule 5: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is NO for intraduc¬ 
tal carcinoma (nine samples classified). 

Conditions: 

The irregularity in the size of the calcifications is not mild 

and the variation in the density of the calcifications is not 
marked 

and the density of the calcifications is not moderate 

and the comparison with previous exam does not show a 
change in the number or character of calcifications 

and the comparison with previous exam is defined. 

Rule 6: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is YES for intraduc¬ 
tal carcinoma (six samples classified). 

Conditions: 

The variation in the density of the calcifications is marked 
and the comparison with previous exam is newly developed. 

Rule 7: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is NO for intraduc¬ 
tal carcinoma (four samples classified). 

Conditions: 

The variation in the density of the calcifications is not 
moderate 

and Le Gal type is #1 

and the comparison with previous exam is defined. 

Rule 8: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is YES for intraduc¬ 
tal carcinoma (three samples classified). 
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Conditions: 

The irregularity in the size of the calcifications is not 
moderate 

and Le Gal type is not #1 

and the comparison with previous exam is defined 
and the associated findings show architectural distortion. 

3. In this Note we present the first eight rules from Table 5, for the 
malignant tumor case, generated from the feature set obtained 
by a Memetic algorithm, in natural language. Once again, these 
rules are to be used in a cascade fashion. That is, they are to be 
applied successively, beginning with rule #1, until a rule is 
satisfied. 

• Rule 1: If the following conditions are true THEN the pre¬ 
diction is NO for malignant (ten samples classified). 

Conditions: 

The irregularity in the shape of the calcifications is not 
moderate 

and Le Gal type is #1. 

• Rule 2: If the following conditions are true, and the previous 
rule does not apply, THEN the prediction is YES for malig¬ 
nant (15 samples classified). 

Conditions: 

The irregularity in the shape of the calcifications is not mild 

and there is ductal orientation 

and the density of the calcifications is moderate. 

• Rule 3: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is YES for malignant 
(six samples classified). 

Conditions: 

The total number of calcifications is greater than 30 

and the irregularity in the shape of the calcifications is not 
mild 

and the irregularity in the size of the calcifications is marked 

and there is no ductal orientation 

and the density of the calcifications is not low. 

• Rule 4: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is YES for malignant 
(four samples classified). 

Conditions: 

The irregularity in the shape of the calcifications is moderate 
and the comparison with previous exam is newly developed. 


Rule 5: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is NO for malignant 
(eight samples classified). 

Conditions: 

The irregularity in the size of the calcifications is not marked 

and Le Gal type is #2 

and there is no ductal orientation 

and the density of the calcifications is high 

and the comparison with previous exam is defined. 

Rule 6: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is NO for malignant 
(five samples classified). 

Conditions: 

The total number of calcifications is between 10 and 30 


and 

the irregularity in the shape of the calcifications 
moderate 

is not 

and 

the irregularity in the size of the calcifications 
marked 

is not 

and 

Le Gal type is not #1 


and 

Le Gal type is #2 


and 

there is no ductal orientation 


and 

the comparison with previous exam is defined 


and 

the comparison with previous exam is not 
developed. 

newly 


Rule 7: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is NO for malignant 
(13 samples classified). 

Conditions: 

The total number of calcifications is less than 10 

and the irregularity in the shape of the calcifications is not 
mild 

and the comparison with previous exam is defined. 

Rule 8: If the following conditions are true, and the previous 
rules do not apply, THEN the prediction is YES for malignant 
(six samples classified). 

Conditions: 

The total number of calcifications is less than (or equal to) 30 

and the irregularity in the size of the calcifications is not mild 

and Le Gal type is not #1 

and there is no ductal orientation 

and the density of the calcifications is low 
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and the comparison with previous exam shows a change in 
the number or character of calcifications. 
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Inference Method for Developing Mathematical Models 
of Cell Signaling Pathways Using Proteomic Datasets 

Tianhai Tian and Jiangning Song 

Abstract 

The progress in proteomics technologies has led to a rapid accumulation of large-scale proteomic datasets in 
recent years, which provides an unprecedented opportunity and valuable resources to understand how 
living organisms perform necessary functions at systems levels. This work presents a computational method 
for designing mathematical models based on proteomic datasets. Using the mitogen-activated protein 
(MAP) kinase pathway as the test system, we first develop a mathematical model including the cytosolic and 
nuclear subsystems. A key step of modeling is to apply a genetic algorithm to infer unknown model 
parameters. Then the robustness property of mathematical models is used as a criterion to select appropriate 
rate constants from the estimated candidates. Moreover, quantitative information such as the absolute 
protein concentrations is used to further refine the mathematical model. The successful application of this 
inference method to the MAP kinase pathway suggests that it is a useful and powerful approach for 
developing accurate mathematical models to gain important insights into the regulatory mechanisms of 
cell signaling pathways. 

Key words Cell signaling pathway, Reverse engineering, Proteomics, Robustness 


1 Introduction 


Proteomics is considered as the next crucial step to study biological 
systems in the post-genomic era, as it allows large-scale determina¬ 
tion of genetic and cellular functions at the proteome level [1, 2]. 
The proteome is the complete repertoire of proteins, including 
posttranslational modifications (PTMs) that occur in a particular 
set of proteins. The purpose of proteomics research is to determine 
the relative or absolute amount of proteins presented in a biological 
sample. Advanced proteomic technologies, including mass spec¬ 
trometry (MS), two-dimensional gel electrophoresis and protein 
arrays, provide powerful methods for analyzing protein samples. 
Proteomics technologies have emerged as potent tools for rapid 
identification of proteins in complex biological samples and charac¬ 
terization of PTMs and protein-protein interactions [3, 4]. 
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An important application of MS-based proteomics is to charac¬ 
terize cell signaling cascades, which involve the binding of extracel¬ 
lular signaling molecules to cell-surface receptors, triggering events 
inside the cell [5]. Phosphorylation, a key reversible PTM, plays a 
key role in regulating protein functions and localizations in this 
process. Phosphoproteomics thus serves as a branch of proteomics 
with the purpose of identifying and characterizing proteins that 
contain a phosphate group as a PTM [5]. As a consequence, phos- 
phoproteome studies are able to provide a global and integrated 
description of cellular signaling networks [6, 7]. However, the 
complex nature of the cell signaling pathways remains to be fully 
characterized as to how they are exactly regulated in vivo and what 
parameters are responsible for determining their dynamics [8]. 
To address these questions, mathematical modeling is a powerful 
approach for deducing regulatory principles and interpreting 
signal transduction mechanisms that underlie various cellular 
functions [9]. 

The lack of kinetic rates for mathematical modeling is a major 
challenge for developing systems biology approaches. These 
should, in principle, be measured by experiments or estimated 
from experimental data. However, due to the limited amount of 
experimental data, a commonly adopted approach in systems biol¬ 
ogy studies is to collect published experimental data obtained from 
different cell types under various conditions. Therefore, the prog¬ 
ress in proteomics technologies and the rapid accumulation of 
proteomic data have offered an unprecedented opportunity to 
better understand how living organisms perform necessary func¬ 
tions at systems levels. From a systems biology perspective, the 
dynamic temporal data generated by phosphoproteomics experi¬ 
ments represent valuable resources for inferring unknown model 
parameters and modeling cell signaling networks [10]. However, to 
date, only limited work has been done to utilize the temporal 
dynamic proteome datasets for mathematical modeling of 
biological systems. This chapter presents a computational frame¬ 
work for developing accurate mathematical models using proteo¬ 
mic datasets. 


2 Review of Modeling for the MAP Kinase Pathway 

The mitogen-activated protein (MAP) kinase pathway is one of the 
most extensively studied signaling pathways. It communicates signals 
from the growth factor receptors on the cell surface to effector 
molecules located in the cytoplasm and nucleus. The MAP kinase 
cascade can be activated by the upstream input signal Ras protein, 
and comprises a set of three protein kinases: Raf, MEK, and ERK, 
together with a highly conserved molecular architecture that acts 
sequentially [11]. The activated MAP kinase is able to phosphorylate 
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multiple different substrates, including transcription factors, protein 
kinases, phospholipases, and cytoskeletal proteins, and regulate a 
wide range of physiological responses, including cell proliferation, 
differentiation, apoptosis, and tissue development. The signaling 
downstream of Ras protein has an incredible complexity, which 
includes positive and negative feedback loops, protein re¬ 
localization, signaling complex formation, and cross talk between 
parallel signaling pathways. 

The EGF-regulated MAP kinase pathway is considered as the 
best-characterized signal transduction pathway. In the last two 
decades there has been a significant amount of experimental data 
published regarding signaling entities, regulatory interactions, 
kinase activities, protein absolute concentrations, and perturbation 
studies. Moreover, the advances in systems biology have led to the 
development of a large number of sophisticated mathematical mod¬ 
els with various assumptions about the regulatory mechanisms at 
different levels as well as model parameters inferred from experi¬ 
mental data under various experimental conditions and from dif¬ 
ferent cell types. Although the principal hierarchy of the signaling 
pathway and its activation sequence are well established, recent 
experimental studies have provided additional information on criti¬ 
cal protein-protein interactions, regulatory loops, and spatiotem- 
poral organization [12]. 

In the last decade, the MAP kinase pathway has often been used 
as a testable paradigm for interrogating systems biology 
approaches. In 1996, Huang and Ferrell developed the first mathe¬ 
matical model by focusing on the Ras-dependent activation of the 
MAP kinase module. The model could predict highly ultra-sensitive 
responses of the MAP kinase cascade and was then confirmed by 
experimentation [13]. The success of this work has stimulated a 
great deal of interest in developing kinetic models to provide test¬ 
able predictions and novel insights into signaling events. For exam¬ 
ple, Bhalla et al. combined experiments and modeling to support 
the hypothesis that MAP kinase was involved in a bistable feedback 
loop [14]; Schoeberl et al. developed a mathematical model for the 
EGF-regulated MAP kinase pathway [15]; we demonstrated a crit¬ 
ical function of Ras nanoclusters in generating high-fidelity signal 
transduction [16]; and a recent study investigated functional cross 
talks between the MAP kinase pathway and other signaling path¬ 
ways [17]. In addition, we have developed a mathematical model 
that contains a nuclear subsystem of ERK kinase activation [18] and 
studied the robustness property of various kinase modules [19]. 
Nevertheless, the molecular mechanisms that underlie precise but 
robust control of MAP kinase signal intensity with a range of 
activation kinetics and diverse biological outcomes remain poorly 
understood. Using the MAP kinase pathway as the test system, this 
chapter discusses how to design a computational framework for 
developing accurate mathematical models of cell signaling pathway 
using proteomic datasets. 
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3 Methods 

3.1 Experimental 
Data 


3.2 Development of 
Mathematical Model 


Olsen et al. have recently applied an integrated phosphoproteomic 
technology to identify and quantitate the global in vivo phospho- 
proteome and its temporal dynamics upon growth-factor stimula¬ 
tion in human HeLa cells. In this study, human Hela cells were 
stimulated with 150 ng/ml of EGF for different time intervals. 
This proteome dataset includes the quantitative temporal activity 
ratios of 2244 proteins with a total of 6600 phosphorylation sites, 
and is available as an excel file in the supplementary information of 
the reference [6]. However, this dataset includes a proportion of 
missing values for quite a large number of proteins (see Note 1). 

We used the proteomic data of the ARafl protein, the dual 
specificity mitogen-activated protein kinase kinase 2 (MEK) and the 
mitogen-activated protein kinase 1 (ERK) in the supplementary 
table. In this dataset, the kinase activities were measured at 0, 1, 
5, 10, and 20 min. The activities of each kinase were further 
normalized by its activity at 5 min. While the activities of ARafl 
were obtained in the cytosol only, the activities of MEK and ERK 
were available in both the cytosol and nucleus. Since the kinase 
activities in the proteomic dataset were mostly available at five time 
points, we used the linear interpolation to generate kinase activities 
at another 16 time points during the time interval [0, 20] (min). 

Additional experimental data were also available using Western 
blotting analysis and other experimental techniques in human 
HeLa cells [20]. Hela cells were stimulated with 50 ng/ml of 
EGF for different time periods. Therefore, both datasets in previ¬ 
ous studies [6, 20] can be combined in our study. The Ras activity 
in Ref. [20] was used as the signal input of the MAP kinase module. 
The absolute kinase concentrations and the fractions of the acti¬ 
vated kinases (at 5 min) in Ref. [20] were also used and led to the 
absolute activated kinase concentrations at 5 min shown in Table 1 . 
The relative kinase activities in the proteomic study were then re¬ 
scaled using the absolute activated kinase concentrations at 5 min. 
It is noteworthy that the Raf, MEK, and ERK kinase activities in 
Ref. [20] were utilized only to compare with the simulated kinase 
activities and served as evidence to validate the feasibility of the 
proposed modeling framework. 

Our MAP kinase pathway model comprises a cytosolic subsystem 
and a nuclear subsystem [18]. In the cytosolic subsystem, the Ras- 
GTP is the signal input of the MAP kinase cascade and activates Raf 
molecules in a single step. This activation is followed by sequential 
activation of the dual-specificity MAP kinase kinase (i.e., MEK) by 
Raf* (i.e., activated Raf) in a single-step processive module (see 
Note 2). The activated MEKpp (i.e., phosphorylated MEK at two 
residue positions) in turn activates ERK in a two-step distributive 
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Table 1 

Protein concentrations of the pathway models 



Initial condition 
of System 1 

Initial condition 
of Systems 2 

Max % of the activated Activated kinases at 

kinase at 5 min in System 2 5 min in System 2 

[Ras] 

1 

0.4 [20] 

0.4 

[Rafl 

1 

0.013 [20] 

0.013 

[Raf-P’ase] 

1 

0.002 [15] 


[MEK] 

1 

1.4 [20] 

5 % [20] 0.07 

[MEK-P’ase] 

1 

0.14 [15] 


[ERK] 

1 

0.96 [20] 

50 % [20] 0.48 

[ERK-P’ase] 

1 

0.48 [15] 



System 1 is the model based on the proteomic data only with normalized protein concentrations, while System 2 is the 
model based on both proteomic and other experimental data with absolute protein concentrations. Except the variables 
in this table, the initial conditions of other variables are set as zeros. The concentrations of three phosphatases are 
calculated based on both the absolute kinase concentration in Ref. [20] and ratio of phosphatase concentration to the 
corresponding kinase concentration in Ref. [15] 


module [21]. The activated ERKpp (i.e., phosphorylated ERK at 
two residue positions) is the signal output of the MAK kinase 
module. Both activated and inactivated MEK and ERK kinases 
diffuse between the cytosol and nucleus freely. In the nuclear 
subsystem, the activated MEKpp further activates the ERK kinase 
via the distributive two-step phosphorylation module. In addition, 
phosphatases, such as Raf-P’ase, MEK-P’ase, and ERK-P’ase, can 
respectively deactivate the activated Raf*, MEKpp, and ERKpp 
kinases at different subcellular locations. 

The detailed process of kinase activation is described by a set of 
chemical reactions [18]. Briefly, the activated kinase (or phospha¬ 
tase) K binds to its substrate S (or activated kinase Sp) to form a 
protein complex K-S (or K-Sp), which leads to the activated sub¬ 
strate Sp (or deactivated kinase S). Examples of these reactions are 
provided below: 

1. Processive phosphorylation module of MEK kinase 


Hi 

Raf * + MEK y Raf * - MEK T Raf * + MEKpp (1) 

di 


2. Distributive phosphorylation module of ERK kinase 

Ctj 

MEKpp + ERK MEKpp - ERK MEKpp + ERKp 


( 2 ) 
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3.3 Estimation 
of Model’s Kinetic 
Rates 


MEKpp + ERKp MEKpp - ERKp -4 MEKpp + ERKpp (3) 


3. Dephosphorylation reactions of activated ERKltinase 

dj 

ERKp + ERK-P’ase ERKp - ERK-P’ase ^ ERK + ERK-P’ase (4) 

dj 

dj 

ERKpp + ERK-P’aseERKpp - ERK-P’aseERKp +ERK-P’ase (5) 

dj 

where %, d t and kj represent protein binding, dissociation and 
activation rate constants, respectively. The diffusion of MEK kinase, 
for example, between the cytosolic and nuclear subsystems is repre¬ 
sented by 

A 

MEK N-MEK, (6) 

bi 

where MEK and N-MEK are MEK kinases located in the cytosolic 
and nuclear subsystems, respectively, f] and b t are diffusion rate 
constants. 

A mathematical model has been constructed according to the 
chemical rate equations of these chemical reactions [18]. For exam¬ 
ple, Reaction 1 leads to the differential equation for the dynamics of 
the Raf*-MEK complex, which is given by 

i[Raf * — MEK] = ^j Raf [ M£K ] + *.)[R a f * _ MEK] (7) 

dt 

This mathematical model comprises 33 differential equations which 
represent the dynamics of 33 variables in the system. To test all the 
possibilities of molecular mechanisms, we make no assumptions 
about the model rate constants and as a result there are 57 
unknown reaction rate constants. A promising research topic is to 
develop sophisticated models with less unknown parameters 
(see Note 3). 

A genetic algorithm is used to estimate all model parameters. The 
MATLAB toolbox developed by Chipperfield et al. [22] is 
employed to infer the 57 unknown rate constants. It uses MATLAB 
functions to build a set of versatile routines for implementation of a 
wide range of genetic algorithms. The genetic algorithm is run over 
500 generations for each rate estimate, and a population of 100 
individuals in each generation is used. The values of rate constants 
are taken initially from the uniform distribution in the range of 
[0, W max ], and the value of W max is fixed to 1000 for each rate 
constant. 
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The estimation error is measured by the weighted distance 
between the simulated kinase activities and experimental data. 
The weight of each kinase is determined by its corresponding 
maximal activity. The total error is calculated by 


N M 


£ = EE 

i=l j= 1 


| *.•(*/) - x Jj 

maxj 


( 8 ) 


where x t * and x t { tj) are the simulated and experimentally measured 
activities of kinase x t at time point t p respectively. 

First, we use the genetic algorithm to infer the model kinetic 
rates based on the proteomic dataset [6]. The corresponding model 
is termed System 1. The total concentration of each kinase or 
phosphatase is assumed to be one unit. The initial condition of 
the differential equation model is given in Table 1 . To be consistent 
with the normalized kinase activities in the proteomic dataset [6], 
the simulated activity of each kinase is also normalized by its activity 
at 5 min; and we choose nrnxj { Xi(tj )} = 1 in Eq. (8) for calculat¬ 
ing the error between the simulation and proteomic data. The 
parameter set that produces smaller simulation error with respect 
to the proteomics data is selected as the estimated model rate 
constants. Due to the local maxima issue of the genetic algorithm, 
we implement the genetic algorithm with different random seeds 
that lead to different estimates of the model’s kinetic rates. Accord¬ 
ingly, we obtain 20 sets of estimated rate constants and select the 
top ten estimates with smaller simulation errors when compared to 
the proteomic data for further analysis. Next, we use the robustness 
property of the model as an additional criterion to select the opti¬ 
mal rate constants. 

Figure 1 provides the simulation results of the MAP kinase 
pathway using the model that has the smallest estimation error. 
The corresponding estimated model parameters are given in the 
Supplementary Table 1 in Ref. [18]. To compare with the proteo¬ 
mic data, simulations are also normalized by the simulated kinase 
activity at 5 min. The total activity of MEK in Fig. lc (ERK in 
Fig. Id) is also normalized by the corresponding total kinase activ¬ 
ity at 5 min. The results show that the simulated kinase activities 
match the Raf* activities in the cytosol (Fig. lb) and ERKpp 
activities in both the cytosol and nucleus (Fig. If) very well. How¬ 
ever, there is a large difference between the simulated MEK activ¬ 
ities and proteomic data in Fig. lc, possibly because of difference 
between the MEK kinase proteomic data in the cytosol and nucleus 
as well as noise in proteomic data (see Note 4). In addition, the 
simulated MEK activities in the nucleus are also in good agreement 
with the proteomic data. 




336 


Tianhai Tian and Jiangning Song 






Fig. 1 Simulations of the normalized kinase activities, (a) Normalized Ras activity as the signal input from [20]. 
(b) Raf activity; (c) Total MEK activity; and (d) Total ERK activity ( blue-line : simulation; green-line : normalized 
Western blotting data [20]; red-line : proteomic data [6]). (e) MEK activity and (f) ERK activity at different 
locations ( blue-line : simulation in the cytosol, red-line : proteomic data in the cytosol, green-line\ simulation in 
the nucleus, black-line : proteomic data in the nucleus) 


3.4 Robustness Robustness can be defined as the ability of a system to function 

Property Analysis correctly in the presence of both internal and external uncertainty. 

As robustness is a ubiquitously observed property of biological 
systems [23, 24], it has been widely used as an important measure 
to select the optimal network structure or model rate constants 
from estimated candidates, including the MAP kinase pathway [25, 
26]. A formal and abstract definition of the robustness property, 
given by Kitano [27], is consistent with the general principle of the 
robustness property of complex systems, and has been widely used 
in the analysis of robustness properties of biological systems. 

Here, we use the concept proposed by Kitano [27] to measure 
the robustness property of the model. The robustness property of a 
mathematical model with respect to a set of perturbations P is 
defined as the average of an evaluation function D^p of the system 
over all perturbations p^P, weighted by the perturbation prob¬ 
abilities prob(^), given by 

K,p= [ prob {p)D s P dp 

JpZEP 


( 9 ) 
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Here, the following measure is used to evaluate the average 
behavior 


R 


M _ 
a,P 



prob (p) Xij (p) dp 


( 10 ) 


which is the mean of kinase activities that should be close to the 
simulated kinase activity obtained from the unperturbed rate con¬ 
stants. In addition, the impact of perturbations on nominal behav¬ 
ior is defined by 


t>N 

K a,P 


E 

hJ L' 


/^pro b (p)(*i,'(p) -Xij(p)) dp 


( 11 ) 


where x i3 {p) and x^ are the simulated activities of kinase Xi at time 
point tj w ith pe rturbed and unperturbed rate constants, respec¬ 
tively, and Xij(p) is the mean of x^p) over all the perturbed kinetic 
rates. 


For each rate constant the perturbation is set to 


ki = max{ki{ 1 + //(£/ — 0.5)),0} (12) 

with a uniformly distributed random variable U( 0,1) or 


ki = max{ki(l + //N), 0} (13) 

with the standard Gaussian random variable N( 0,1). Here /u repre¬ 
sents the perturbation strength. 

To identify the best set of kinetic rates, we perform the robust¬ 
ness analysis of the mathematical model for the selected ten esti¬ 
mates of kinetic rates. For each set of model rate constants, we first 
use the estimated kinetic rates without any perturbation to produce 
a simulation that is used as the standard kinase activity. Then we 
perturb the value of each parameter using the generated random 
number. New simulations are obtained using the perturbed rate 
constants, and we then compare the new simulations with the 
standard simulation derived from the unperturbed model rate con¬ 
stants. The system with a particular set of rate constants is more 
stable if the difference between the new simulations and standard 
simulation is smaller. For each set of estimated rate constants, we 
generate 10,000 sets of perturbed rate constants using the uni¬ 
formly distributed random variable and fi — 0.5 in Eq. (12). 
According to Kitano’s definition of robustness [27], we use the 
average behavior, which is the sum of all the means of each kinase 
activity as calculated by Eq. (8), and the nominal behavior, which is 
the sum of all the variances of each kinase activity as calculated by 
Eq. (9), as the measure of the robustness property. 

Figure 2a and b illustrate the average behavior and nominal 
behavior of the mathematical model with ten different sets of 
estimated rate constants. We further test the robustness property 
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Fig. 2 Robustness analysis, (a, b) Robustness analysis of the proposed model with ten sets of estimated 
kinetic rates derived from the normalized proteomic data, (a) The average behavior and (b) nominal behavior 
of the model with perturbed kinetic rates, (c, d) Robustness analysis of the proposed model with ten sets of 
estimated kinetic rates that were derived from more resources of experimental data, (c) The average behavior 
and (d) nominal behavior of the model with perturbed kinetic rates. ( Blue-line : Raf; green-line : MEK, red-line : 
ERK. The horizontal dash lines in (a) and (c) are the simulated kinase activities based on the unperturbed 
model kinetic rates) 


of this model in cases where the ten sets of estimated rate constants 
are perturbed by the Gaussian random variable with strength 
fi — 0.5 in Eq. (11). In this case, the simulated perturbations of 
kinase activities are smaller than but still proportional to the 
corresponding perturbations in Fig. 2a and b (results not shown). 
In addition, we test the robustness property of the model using the 
ten sets of the rejected rate constants that generate simulations with 
larger errors. Simulation results suggest that there is no correlation 
between the model estimation error and robustness property. 

To demonstrate the feasibility of our approach, we compare our 
simulated kinase activities in Fig. 1 with the kinase activities 
measured in vivo by Western blotting that are also normalized by 
its activity at 5 min [20]. Figure 1 shows that our computer 
simulation matches the Raf activity (Fig. lb) and ERK activity 
(Fig. Id) very well. However, the measured MEK activity in 
Fig. lc is different from the proteomic data, and interestingly, the 
simulated MEK activity is located between the proteomic data and 
Western blotting data. The simulated MEK activity is smaller, 
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3.5 Model 
Refinement by 
Incorporating More 
Experimental Data 


instead of being larger than the proteomic data, with the increasing 
time. This suggests that in the cell signaling cascade, the down¬ 
stream signal activity may be used to calibrate the measurement 
errors of the upstream signals present in the proteomic datasets. 

Although the normalized simulation results match the proteomic 
and experimental data very well in Fig. 1, the robustness analysis 
shown in Fig. 2 suggest that the percentages of the activated kinases 
are quite low. In addition, the fraction of the activated MEK kinase 
is larger than that of the activated ERK, which is contradictory to 
previous observations [15, 16, 20]. When using the absolute pro¬ 
tein concentrations as the initial condition to simulate the model, 
we find a large difference between the predicted kinase activities 
and experimentally measured activities [20]. These results suggest 
that the normalized proteomic data might not be adequate for 
accurate inference of cell signaling pathway. To achieve better 
inference results, more experimental data, such as proteomic data, 
should be incorporated to the model [28] (see Note 5). 

Therefore, to further refine the mathematical model, we use 
the experimentally measured absolute total concentrations of each 
kinase, which is also the initial condition of System 2 in Table 1, 
together with the information on the maximal percentages of MEK 
and ERK kinases that are activated by EGF stimulation [20], pre¬ 
sented in Table 1 . Then the normalized proteomic data (with kinase 
activity of unit one at 5 min) are rescaled by the absolute kinase 
activities in Table 1 . The kinase activity is calculated by 

[kinase activity] = [proteomic kinase activity] * [kinase activity at 5 min in System 2]. 

Note that the related activities of each kinase remain unchanged. In 
addition, the absolute concentrations of the three phosphatases, 
namely Raf-P’ase, MEK-P’ase, and ERK-P’ase, are also included in 
the model using experimentally measured data [15, 20], which is 
part of the initial conditions of System 2 in Table 1. Note that the 
Raf, MEK, and ERK kinase activities in Fig. 7 in Ref. [20] are only 
used to compare with the simulated kinase activities. As no further 
information is currently available regarding the distributions of 
activated MEK and ERK kinases at different subcellular locations, 
we use the proteomic data to generate normalized kinase activities 
in the cytosol and nucleus. In summary, the experimental data 
provide: (1) the absolute concentrations of the activated Raf, total 
MEK activity and total ERK activity in the first 20 min stimulated 
by Ras-GTP-binding; (2) the normalized activities of MEK and 
ERK kinases in the cytosol and nucleus in the first 20 min. 

We use these experimental data to infer the model rate con¬ 
stants once again. To balance the errors of different kinases, the 
weight to scale the error of each kinase in Eq. (8) is the experimen¬ 
tally measured maximal activity of that kinase. However, for the 
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normalized activities of MEK and ERK in the cytosol and nucleus, 
the weight in Eq. (8) is set to unit one. In this case, we also derive 
20 sets of estimated model rate constants by repeated implementa¬ 
tions of the genetic algorithm and select the top ten sets with 
smaller estimation errors. For the top ten sets of model rate con¬ 
stants, we use the same method previously described to perform the 
robustness analysis, and select kinetic rates that lead to the best 
robustness property of the system as our final estimate [18]. 

The major advantage of incorporating more experimental data 
is that the mathematical model can realize experimental observa¬ 
tions in a much more accurate manner and accordingly computer 
simulations are able to provide testable predictions regarding the 
regulatory mechanisms. Figure 3 displays the simulated system 
dynamics with the absolute kinase activities. We can see that com¬ 
puter simulations match the experimental data very well for the Raf 
activities in Fig. 3b, and the total ERK kinase activities in Fig. 3d. 
Moreover, the normalized MEK activity in the cytosol is very close 
to that in the nucleus, which is consistent with the experimental 
observation [20]. Another advantage of the refined model is that it 
has a very good robustness property in response to the perturba¬ 
tions of rate constants. Compared with the results in Fig. 2a and b, 
the numerical results in Fig. 2c and d suggest that the developed 
model based on the absolute kinase concentrations has a better 
robustness property than that based on the normalized kinase 
concentrations. 


4 Framework for Developing Mathematical Models 

In summary, the flowchart of our proposed modeling framework is 
illustrated in Fig. 4. The model may also include a graphical sche¬ 
matic structure of the signaling pathway, a list of all chemical 
reactions involved and a mathematical model that is a system of 
differential equations. The proteomic data are the time-course 
quantitative data of kinase activities. The other datasets include 
data resources obtained by other experimental techniques such as 
the FRET imaging and Western blotting. Using the genetic algo¬ 
rithm, we can obtain a number of candidate estimates of model 
parameters. The robustness analysis will be applied to the estimated 
candidates to identify the parameter set that has the best robustness 
property as our final parameter estimate. Finally, we can apply the 
built model to make testable predictions regarding the signal out¬ 
put under various system conditions. 
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Fig. 3 Simulated kinase activities based on the incorporation of proteomic data and Western blotting data, (a) 
Normalized Ras activity as the signal input [20]. (b) Raf activity; (c) Total MEK activity, and (d) Total ERK 
activity ( blue-line : simulation; green-line : Western blotting data [20]; red-line : re-scaled proteomic data [6]). 
(e) MEK activity and (f) ERK activity at different locations ( blue-line : simulation in the cytosol, red-line : 
proteomic data in the cytosol, green-line\ simulation in the nucleus, black-line : proteomic data in the nucleus) 


5 Notes 


1. A major challenging issue in using proteomic data is missing 
values of kinase activities. Although a number of statistical meth¬ 
ods have been proposed to estimate the missing value, the 
implementation of these methods will be extremely difficult, if 
the activity of a protein is completely unavailable in the proteo- 
mics dataset. An example is the Ras protein whose activities are 
not available in the proteomic dataset at all. Thus, other sources 
of biological data must be used to fill these data gaps. 

2. The MAP kinase cascade comprises a set of three protein kinases, 
namely MAP kinase kinase kinase (MAPKKK or MAP3K), MAP 
kinase kinase (MAPKK or MAP2K), and MAP kinase (MAPK), 
with a highly conserved molecular architecture that acts sequen¬ 
tially [ 11 ]. In the MAP kinase pathway discussed in this chapter, 
these three kinases are Raf, MEK, and ERK proteins. 

3. This chapter focuses on the issue of establishing mathematical 
models from proteomic datasets. However, only small amounts 
of experimental data are used in this work to refine the 
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Fig. 4 Flowchart of the proposed modeling framework for developing mathematical models of cell signaling 
pathways using proteomic datasets 

developed mathematical model. As a result, simulation results 
suggest that integration of more experimental data could further 
improve the accuracy of the mathematical model substantially. 
Future work should thus include the development of more 
sophisticated models for cell signaling pathways through the 
combination of large-scale proteomic datasets, more experimen¬ 
tal data, more signaling regulatory mechanisms as well as esti¬ 
mated model parameters. 

4. Proteomics data suffer from considerable noise, including not 
only the technical noise arising from repeated experimental pro¬ 
cesses but also analysis noise [29]. Noise, such as the error of 
MEK kinase activity in this study, may result in significant varia¬ 
tions during the inference of mathematical models. However, 
compared with the developed stochastic methods for interrogat¬ 
ing the role of noise in microarray expression data [30, 31], the 
study of noise in proteomic data is still at the very early stage of 
development. More work is required to investigate the influence 
of noise on the development of mathematical models based on 
the noisy proteomic datasets. 

5. Another issue is the normalization of proteomic data that causes 
the uncertainty of protein concentrations in mathematical 
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modeling. In this framework we first use the unified protein 
concentrations, where the information of the absolute protein 
concentrations is not known a priori. Although the normalized 
simulations can match the normalized experimental data very 
well, the simulated relative protein concentrations do not neces¬ 
sarily reflect the real scenario of signaling pathways, since the 
concentrations of proteins may also play an important role in 
modulating signaling transduction. Therefore, it is necessary to 
enrich the data by integrating more sources of experimental data 
prior to model simulation. 
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Chapter 19 


Clustering 

G.J. McLachlan, R.W. Bean, and S.K. Ng 

Abstract 

Clustering techniques are used to arrange genes in some natural way, that is, to organize genes into groups 
or clusters with similar behavior across relevant tissue samples (or cell lines). These techniques can also be 
applied to tissues rather than genes. Methods such as hierarchical agglomerative clustering, Lmeans 
clustering, the self-organizing map, and model-based methods have been used. Here we focus on mixtures 
of normals to provide a model-based clustering of tissue samples (gene signatures) and of gene profiles, 
including time-course gene expression data. 

Key words Clustering of tissue samples, Clustering of gene profiles, Hierarchical agglomerative 
methods, Partitional methods, Lmeans, Model-based methods, Normal mixture models, Mixtures 
of factor analyzers, Mixtures of linear mixed-effects models, Time-course data, Autoregressive random 
effects 


1 Introduction 


DNA microarray technology, first described in the mid-1990s, is a 
method to perform experiments on thousands of gene fragments in 
parallel. Its widespread use has led to a huge growth in the amount 
of expression data available. A variety of multivariate analysis meth¬ 
ods has been used to explore these data for relationships among the 
genes and the tissue samples. Cluster analysis has been one of the 
most frequently used methods for these purposes. It has demon¬ 
strated its utility in the elucidation of unknown gene function, the 
validation of gene discoveries, and the interpretation of biological 
processes; see[ 1,2] for examples. 

The main goal of microarray analysis of many diseases, in 
particular of unclassified cancer, is to identify as yet unclassified 
cancer subtypes for subsequent validation and prediction, and ulti¬ 
mately to develop individualized prognosis and therapy. Limiting 
factors include the difficulties of tissue acquisition and the expense 
of microarray experiments. Thus, often microarray studies attempt 
to perform a cluster analysis of a small number of tumor samples on 
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the basis of a large number of genes, and can result in gene-to- 
sample ratios of approximately 100-fold. 

Many researchers have explored the use of clustering techni¬ 
ques to arrange genes in some natural order, that is, to organize 
genes into clusters with similar behavior across relevant tissue sam¬ 
ples (or cell lines). Although a cluster does not automatically corre¬ 
spond to a pathway, it is a reasonable approximation that genes in 
the same cluster have something to do with each other or are 
directly involved in the same pathway. 

It can be seen there are two distinct but related clustering 
problems with microarray data. One problem concerns the cluster¬ 
ing of the tissues on the basis of the genes; the other concerns the 
clustering of the genes on the basis of the tissues. This duality in 
cluster analysis is quite common. In the present context of micro- 
array data, one may be interested in grouping tissues (patients) with 
similar expression values or in grouping genes on patients with 
similar types of tumors or similar survival rates. 

One of the difficulties of clustering is that the notion of a 
cluster is vague. A useful way to think about the different clustering 
procedures is in terms of the shape of the clusters produced [3]. 
The majority of the existing clustering methods assume that a 
similarity measure or metric is known a priori; often the Euclidean 
metric is used. But clearly, it would be more appropriate to use a 
metric that depends on the shape of the clusters. As pointed out by 
[4], the difficulty is that the shape of the clusters is not known until 
the clusters have been found, and the clusters cannot be effectively 
identified unless the shapes are known. 

Before we proceed to consider the clustering of microarray 
data, we give a brief account of clustering in a general context. 
For a more detailed account of cluster analysis, the reader is referred 
to the many books that either consider or are devoted exclusively to 
this topic; for example, [5-9] and [10, Chapter 7]. A recent review 
article on clustering is [11]. 

1.1 Brief Review of 
Some Clustering 
Methods 


that is, the jth column of Y pxn is the observation vector y r 

In discriminant analysis (supervised learning), the data are 
classified with respect to jj known classes and the intent is to form 


Cluster analysis is concerned with grouping a number ( ri) of entities 
into a smaller number (jj) of groups on the basis of observations 
measured on some variables associated with each entity. We let 

jy = (y\j, • • •, y p j^J be the observation or feature vector containing 

the values of p measurements yiy, ..., y pj - made on the jth entity 
(/= 1, ..., n) to be clustered. These data can be organized as a 
matrix, 

Ypxn = ((y VJ ))-, (1) 
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a classifier or prediction rule on the basis of these classified data for 
assigning an unclassified entity to one of the ^classes on the basis of 
its feature vector. In contrast to discriminant analysis, in cluster 
analysis (unsupervised learning) there is no prior information on 
the group structure of the data or, in the case where it is known that 
the population consists of a number of classes, there are no data of 
known origin with respect to the classes. The clustering problem 
falls into two main categories which overlap to some extent [12]: 

1. What is the best way of dividing the entities into a given number 
of groups, where there is no implication that the resulting 
groups are in any sense a natural division of the data. This is 
sometimes called dissection or segmentation. 

2. What is the best way to find a natural subdivision of the entities 
into groups. Here by natural clusters, it is meant that the clusters 
can be described as continuous regions of the feature space 
containing a relatively high density of points, separated from 
other such regions by regions containing a relatively low density 
of points [5]. It is therefore intended that natural clusters pos¬ 
sess the two intuitive qualities of internal cohesion and external 
isolation [13]. 

Sometimes the distinction between the search for naturally 
occurring clusters as in (2) and other groupings as in (1) is stressed; 
see, for example, [14]. But often it is not made, particularly as most 
methods for finding natural clusters are also useful for segmenting 
the data. Essentially, all methods of cluster analysis attempt to 
imitate what the eye and brain do so well in p = 2 dimensions. 
For example, in the scatter plot (Fig. 1) of the expression values of 
two smooth muscle related genes on ten tumors and ten normal 
tissues from the colon cancer data of [15], it is very easy to detect 
the presence of two clusters of equal size without making the 
meaning of the term “cluster” explicit. 

Clustering methods can be categorized broadly as being hier¬ 
archical or nonhierarchical. With a method in the former category, 
every cluster obtained at any stage is a merger or split of clusters 
obtained at the previous stage. Hierarchical methods can be imple¬ 
mented in a so-called agglomerative manner (bottom-up), starting 
with £f — n clusters or in a divisive manner (top-down), starting 
with the n entities to be clustered as a single cluster. In practice, 
divisive methods can be computationally prohibitive unless the 
sample size n is very small. For instance, there are 2^ -1 ^ — 1 ways 
of making the first subdivision. Hence hierarchical methods are 
usually implemented in an agglomerative manner, as to be dis¬ 
cussed further in the next section. In [16], a hybrid clustering 
method was proposed that combines the strengths of bottom-up 
hierarchical clustering with that of top-down clustering. The first 
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Fig. 1 Scatter plot of the expression values of two genes on ten colon cancer tumors (times) and ten normal 
tissues (open circle) 


method is good at identifying small clusters, but not large ones; the 
strengths are reversed for top-down clustering. 

One of the most popular nonhierarchical methods of clustering 
is k- means, where “k” refers to the number of clusters to be 
imposed on the data. It seeks to find k = jj clusters that minimize 
the sum of the squared Euclidean distances between each observa¬ 
tion yj and its respective cluster mean; that is, it seeks to minimize 
the trace of W, tr W, where 

W =J2Y1 z a (yj ~ y) v? - y) ( 2 ) 

i— 1 j =1 

is the pooled within-cluster sums of squares and products matrix, 
and 
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n 


n 


j= i i =i 


( 3 ) 


is the sample mean of the ith cluster. Here is a zero-one indicator 
variable that is one or zero, according as yj belongs or does not 
belong to the ith cluster (i = 1,.. = 1,...,»). It is impossible 

to consider all partitions of the n observations into ^clusters unless 
n is very small, since the number of such partitions with nonempty 
clusters is the Stirling number of the second kind, 




which can be approximated by^/zf !; see [8]. In practice, &-means 
is therefore implemented by iteratively moving points between 
clusters so as to minimize tr W. In its simplest form, each observa¬ 
tion yj is assigned to the cluster with the nearest center (sample 
mean) and then the center of the cluster is updated before moving 
on to the next observation. Often the centers are estimated initially 
by selecting k points at random from the sample to be clustered. 

Other partitioning methods have been developed, including 
^-medoids [8], which is similar to k- means, but constrains each 
cluster center to be one of the observations y r The self-organizing 
map [17] is similar to fe-means, but the cluster centers are con¬ 
strained to lie on a (two-dimensional) lattice. It is well known that 
k- means tends to lead to spherical clusters since it is predicated on 
normal clusters with (equal) spherical covariance matrices. One way 
to achieve elliptical clusters is to seek clusters that minimize the 
determinant of W, \W\, rather than its trace, as in [18]; see also [19] 
who derived this criterion under certain assumptions of normality 
for the clusters. 

In the absence of any prior knowledge of the metric, it is 
reasonable to adopt a clustering procedure that is invariant under 
affine transformations of the data; that is, invariant under transfor¬ 
mations of the data of the form, 


( 5 ) 


Cy +a 


where C is a nonsingular matrix. If the clustering of a procedure is 
invariant under (5) for only diagonal C, then it is invariant under 
change of measuring units but not rotations. But as commented 
upon in [20], this form of invariance is more compelling than affine 
invariance. The clustering produced by minimization of | W\ is 
affine invariant. 

In the statistical and pattern recognition literature in recent 
times, attention has been focussed on model-based clustering via 
mixtures of normal densities. With this approach, each observation 
vector yj is assumed to have a^-component normal mixture density, 
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(6) 


where (p(y; Ei) denotes the ^-variate normal density function 
with mean and covariance matrix and the Ki denote the 
mixing proportions, which are nonnegative and sum to one. Here 
the vector of unknown parameters consists of the mixing propor¬ 
tions 7i n the elements of the component means Hh and the distinct 
elements of the component-covariance matrix E„ and it can be 
estimated by its maximum likelihood estimate calculated via the 
EM algorithm; see [21, 22]. This approach gives a probabilistic 
clustering defined in terms of the estimated posterior probabilities 
of component membership t* (jy; *¥ j, where t* ^jy; denotes the 
posterior probability that the jth feature vector with observed value 
jy belongs to the ith component of the mixture (i = 1,.. .,^7 = 1, 
..., n). Using Bayes’ theorem, it can be expressed as 




It can be seen that with this approach, we can have a “soft” 
clustering, whereby each observation may partly belong to more 
than one cluster. An outright clustering can be obtained by assign¬ 
ing yj to the component to which it has the greatest estimated 
posterior probability of belonging. The number of components jy 
in the normal mixture model (Eq. 6) has to be specified in advance 
(see Note 1). 

As noted in [23], “Clustering methods based on such mixture 
models allow estimation and hypothesis testing within the frame¬ 
work of standard statistical theory.” Previously, Marriott [12, page 
70) had noted that the mixture likelihood-based approach “is 
about the only clustering technique that is entirely satisfactory 
from the mathematical point of view. It assumes a well-defined 
mathematical model, investigates it by well-established statistical 
techniques, and provides a test of significance for the results.” One 
potential drawback with this approach is that normality is assumed 
for the cluster distributions. However, this assumption would 
appear to be reasonable for the clustering of microarray data after 
appropriate normalization. 

One attractive feature of adopting mixture models with ellipti- 
cally symmetric components such as the normal or its more robust 
version in the form of the /^density [22] is that the implied cluster¬ 
ing is invariant under affine transformations in Eq. (5). Also, in the 
case where the components of the mixture correspond to externally 
defined subpopulations, the unknown parameter vector can be 
estimated consistently by a sequence of roots of the likelihood 
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2 Methods 


equation. Note that this is not the case if a criterion such as mini¬ 
mizing | W\ is used. 

In the above, we have focussed exclusively on methods that are 
applicable for the clustering of the observations and the variables 
considered separately; that is, in the context of clustering micro - 
array data, methods that would be suitable for clustering the tissue 
samples and the genes considered separately rather than simulta¬ 
neously. Pollard and van der Laan [24] proposed a statistical frame¬ 
work for two-way clustering; see also [25] and the references therein 
for earlier approaches to this problem. More recently, [26] reported 
some results on two-way clustering (biclustering) of tissues and 
genes. In their work, they obtained similar results to those obtained 
when the tissues and the genes were clustered separately. 


Although biological experiments vary considerably in their design, 
the data generated by microarray experiments can be viewed as a 
matrix of expression levels. For M microarray experiments 
(corresponding to M tissue samples), where we measure the expres¬ 
sion levels of N genes in each experiment, the results can be 
represented by a N x M matrix. For each tissue, we can consider 
the expression levels of the N genes, called its expression signature. 
Conversely, for each gene, we can consider its expression levels 
across the different tissue samples, called its expression profile. The 
M tissue samples might correspond to each of M different patients 
or, say, to samples from a single patient taken at M different time 
points. The N x M matrix is portrayed in Fig. 2, where each sample 


Sample 1 Sample 2 ... Sample M 


Gene 1 

Gene 2 


Expression Sigoa 



Expression Profile 

c 

n 

jj 


Gene N 


1 




Fig. 2 Gene expression data from M microarray experiments represented as a 
matrix of expression levels with the /Vrows corresponding to the N genes and the 
M columns to the M tissue samples 
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2.1 Clustering of 
Tissues: Hierarchical 
Methods 


represents a separate microarray experiment and generates a set of 
N expression levels, one for each gene. 

Against the above background of clustering methods in a gen¬ 
eral context as given in the previous section, we now consider their 
application to microarray data, concentrating on a model-based 
approach using normal mixtures. But firstly, we consider the appli¬ 
cation of hierarchical agglomerative methods, given their extensive 
use for this purpose in bioinformatics. 

For the clustering of the tissue samples, the microarray data por¬ 
trayed in Fig. 2 are in the form of the matrix in Eq. (1 ) with n = M 
and p — N, and the observation vector y ; corresponds to the 
expression signature for the jth tissue sample. In statistics, it is 
usual to refer to the entirety of the tissue samples as the sample, 
whereas the biologists tend to refer to each individual expression 
signature as a sample; we follow the latter practice here. 

The commonly used hierarchical agglomerative methods can 
be applied directly to this matrix to cluster the tissue samples, since 
they can be implemented by consideration of the matrix of proxi¬ 
mities, or equivalently, the distances, between each pair of observa¬ 
tions. Thus they require only 0(n 2 ) or at worst 0(n 3 ) calculations, 
where n — M and the number M of tissue samples is limited usually 
to being less than 100. The situation would be different with the 
clustering of the genes as then n — N and the number N of genes 
could be in the tens of thousands. 

In order to compute the pairwise distances between the obser¬ 
vations, one needs to select an appropriate distance metric. Metrics 
that are used include Euclidean distance and the Pearson correla¬ 
tion coefficient, although the latter is equivalent to the former if the 
observations have been normalized beforehand to have zero means 
and unit variances. Having selected a distance measure for the 
observations, there is a need to specify a linkage metric between 
clusters. Some commonly used metrics include single linkage, com¬ 
plete linkage, average linkage, and centroid linkage. With single 
linkage, the distance between two clusters is defined by the distance 
between the two nearest observations (one from each cluster), 
while with complete linkage, the cluster distance is defined in 
terms of the distance between the two most distant observations 
(one from each cluster). Average linkage is defined in terms of the 
average of the n\ n 2 distances between all possible pairs of observa¬ 
tions (one from each cluster), where n\ and n 2 denote the number 
of observations in the two clusters in question. For centroid link¬ 
age, the distance between two clusters is the distance between the 
cluster centroids (sample means). Another commonly used method 
is Ward’s procedure [27], which joins clusters so as to minimize the 
within-cluster variance (the trace of W). Lance and Williams [28] 
have presented a simple linear system of equations as a unifying 
framework for these different linkage measures. Eisen et al. [2] 
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were the first to apply cluster analysis to microarray data, using 
average linkage with a correlation-based metric. The nested clusters 
produced by an hierarchical method of clustering can be portrayed 
in a tree diagram, in which the extremities (usually shown at the 
bottom) represent the individual observations, and the branching 
of the tree gives the order of joining together. The height at which 
clusters of points are joined corresponds to the distance between 
the clusters. However, it is not clear in general how to choose the 
number of clusters. 

To illustrate hierarchical agglomerative clustering, we use 
nested polygons in Fig. 3 to show the clusters obtained by applying 
it to six bivariate points, using single-linkage with Euclidean dis¬ 
tance as the distance measure. It can be seen that the cluster of 
observations 5 and 6 is considerably closer to the cluster of 1 and 
2 than observation 4 is. 



b 



Fig. 3 An illustrative example of hierarchical agglomerative clustering 
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22 Clustering of 
Tissues: Normal 
Mixtures 


There is no reason why the clusters should be hierarchical for 
microarray data. It is true that if there is a clear, unequivocal 
grouping, with little or no overlap between the groups, any method 
will reach this grouping. But as pointed out by [12], “hierarchical 
methods are not primarily adapted to finding groups.” For 
instance, if the division into^ = 2 groups given by some hierarchi¬ 
cal method is optimum with respect to some criterion, then the 
subsequent division into^ = 3 groups is unlikely to be so. This is 
due to the restriction that one of the groups must be the same in 
both the^ = 2 and^ = 3 clusterings. As explained by [12], this 
restriction is not a natural one to impose if the purpose is to find a 
natural grouping of the data. In the sequel, we therefore focus on 
nonhierarchical methods of clustering. As advocated by [12, page 
67), “it is better to consider the clustering problem ab initio, 
without imposing any conditions.” 

More recently, increasing attention is being given to model-based 
methods of clustering of microarray data [29-32]. However, the 
normal mixture model (Eq. 6) cannot be directly fitted to the tissue 
samples if the number of genes p used in the expression signature is 
large. This is because the component-covariance matrices 2^ are 
highly parameterized with \p(p+\) distinct elements each. A simple 
way of proceeding in the clustering of high-dimensional data would 
be to take the component-covariance matrices I?,- to be diagonal. 
But this leads to clusters whose axes are aligned with those of the 
feature space, whereas in practice the clusters are of arbitrary orien¬ 
tation. For instance, taking the to be a common multiple of the 
identity matrix leads to a soft-version of k -means which produces 
spherical clusters. 

Banfield and Raftery [33] introduced a parameterization of the 
component-covariance matrix 2^- based on a variant of the standard 
spectral decomposition of Si(i = 1, .. .£[). But if p is large relative 
to the sample size n, it may not be possible to use this decomposi¬ 
tion to infer an appropriate model for the component-covariance 
matrices. Even if it were possible, the results may not be reliable due 
to potential problems with near-singular estimates of the 
component-covariance matrices when p is large relative to n. 

Hence, in fitting normal mixture models with unrestricted 
component-covariance matrices to high-dimensional data, we 
need to consider first some form of dimension reduction and/or 
some form of regularization. A common approach to reducing the 
number of dimensions is to perform a principal component analysis 
(PCA). However, the latter provides only a global linear model for 
the representation of the data in a lower-dimensional subspace. 
Thus it has limited scope in revealing group structure in a data 
set. A global nonlinear approach can be obtained by postulating a 
finite mixture of linear (factor) submodels for the distribution of 
the full observation vector y ; given a relatively small number of 
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(unobservable) factors. That is, we can provide a local dimension¬ 
ality reduction method by a mixture of factor analyzers model, 
which is given by Eq. (6) by imposing on the component- 
covariance matrix the constraint 

I, = B t Bl + D t (* = 1,...,^), (8) 

where Bi is a p x q matrix of factor loadings and D * is a diagonal 
matrix (i = 1, ..., jj). We can think of the use of this mixture of 
factor analyzers model as being purely a method of regularization. 
But in the present context, it might be possible to make a case for it 
being a reasonable model for the correlation structure between the 
genes. This model implies that the latter can be explained by the 
linear dependence of the genes on a small number of latent (unob¬ 
servable variables) specific to each component. 

The EMMIX-GENE program of [31] has been designed for 
the clustering of tissue samples via mixtures of factor analyzers. In 
practice we may wish to work with a subset of the available genes, 
particularly as the fitting of a mixture of factor analyzers will involve 
a considerable amount of computation time for an extremely large 
number of genes. Indeed, the simultaneous use of too many genes 
in the cluster analysis may serve only to create noise that masks the 
effect of a smaller number of genes. Also, the intent of the cluster 
analysis may not be to produce a clustering of the tissues on the 
basis of all the available genes, but rather to discover and study 
different clusterings of the tissues corresponding to different sub¬ 
sets of the genes [24, 34]. As explained in [35], the tissues (cell 
lines or biological samples) may cluster according to cell or tissue 
type (for example, cancerous or healthy) or according to cancer 
type (for example, breast cancer or melanoma). However, the same 
samples may cluster differently according to other cellular charac¬ 
teristics, such as progression through the cell cycle, drug metabo¬ 
lism, mutation, growth rate, or interferon response, all of which 
have a genetic basis. 

Therefore, the EMMIX-GENE procedure has two optional 
steps before the final step of clustering the tissues. The first step 
considers the selection of a subset of relevant genes from the 
available set of genes by screening the genes on an individual basis 
to eliminate those which are of little use in clustering the tissue 
samples. The usefulness of a given gene to the clustering process 
can be assessed formally by a test of the null hypothesis that it has a 
single component normal distribution over the tissue samples (see 
Note 2). Even after this step has been completed, there may still be 
too many genes remaining. Thus there is a second step in EMMIX- 
GENE in which the retained gene profiles are clustered (after 
standardization) into a number of groups on the basis of Euclidean 
distance so that genes with similar profiles are put into the same 
group. In general, care has to be taken with the scaling of variables 
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before clustering of the observations, as the nature of the variables 
can be intrinsically different. In the present context the variables 
(gene expressions) are measured on the same scale. Also, as noted 
above, the clustering of the observations (tissues) via normal mix¬ 
ture models is invariant under changes in scale and location. The 
clustering of the tissue samples can be carried out on the basis of the 
groups considered individually using some or all of the genes within 
a group or collectively. For the latter, we can replace each group by 
a representative (a metagene) such as the sample mean as in the 
EMMIX-GENE procedure. 

To illustrate this approach, we applied the EMMIX-GENE 
procedure to the colon cancer data of [15]. It consists of 
n — 2000 genes and p = 62 columns denoting 40 tumors and 22 
normal tissues. After applying the selection step to this set, there 
were 446 genes remaining in the set. The remaining genes were 
then clustered into 20 groups, which were ranked on the basis of 
—21ogX, where X is the likelihood ratio statistic for testing jj = 1 
versus jy = 2 components in the mixture model. The heat map of 
the second ranked group G 2 is shown in Fig. 4. The clustering of 
the tissues on the basis of the 24 genes in G 2 resulted in a partition 
of the tissues in which one cluster contains 37 tumors (1-29, 
31-32, 34-35, 37-40) and 3 normals (48, 58, 60), and the other 
cluster contains 3 tumors (30, 33, 36) and 19 normals (41-47, 
49-57, 59, 61-62). This corresponds to an error rate of 6 out of 62 
tissues compared to the “true” classification given in [15]. (This is 



Fig. 4 Heat map of 24 genes in group G 2 on 40 tumor and 22 normal tissues in 
Alon data 
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why here we examine the heat map of G 2 instead of Gi) For farther 
details about the results of the tissue clustering procedure on this 
data set, see [ 31]. 

2.3 Clustering of 
Gene Profiles 


(a) there are no replications on any particular entity specifically 
identified as such; 

(b) all the observations on the entities are independent of one 
another. 

These assumptions should hold for the clustering of the tissue 
samples, although the tissue samples have been known to be corre¬ 
lated for different tissues due to flawed experimental conditions. 
However, condition (b) will not hold for the clustering of gene 
profiles, since not all the genes are independently distributed, and 
condition (a) will generally not hold either as the gene profiles may 
be measured over time or on technical replicates. While this corre¬ 
lated structure can be incorporated into the normal mixture model 
in Eq. (6) by appropriate specification of the component-covariance 
matrices it is difficult to fit the model under such specifications. 
For example, the M-step (the maximization step of the EM algo¬ 
rithm) may not exist in closed form. Accordingly, we now consider 
the EMMIX-WIRE model of Ng et al. [36], who adopt condition¬ 
ally a mixture of linear mixed models to specify this correlation 
structure among the tissue samples and to allow for correlations 
among the genes. It also enables covariate information to be 
incorporated into the clustering process. 

For a gene microarray experiment with repeated measure¬ 
ments, we have for the jth gene (j = 1when n = XT, a 

feature vector (profile vector) jy = (jy, • • • , where t is the 

number of distinct tissues in the experiment and 

yij= (yi\ji---iyirj) (/ = i, 

contains the r replications on the/th gene from the /th tissue. Note 
that here, the r replications can also be time points. The dimension 
p of the profile vector yj is equal to the number of microarray 
experiments, p = rt. Conditional on its membership of the it h 
component of the mixture, the EMMIX-WIRE procedure assumes 
that yj follows a linear mixed-effects model (LMM), 

Jy — Xfii + Ubij + Vci + £ij , (9) 

where the elements of /?*• (a m -dimensional vector) are fixed effects 
(unknown constants) (i = 1,.. .g). In Eq. (9), by (a ^-dimensional 
vector) and c t (a ^-dimensional vector) represent the unobservable 


In order to cluster gene profiles, it might seem possible just to 
interchange rows and columns in the data matrix in Eq. (1). But 
with most applications of cluster analysis in practice it is assumed 
that 
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gene- and tissue-specific random effects, respectively, conditional 
on membership of the ith cluster. These random effects represent 
the variation due to the heterogeneity of genes and tissues 
(corresponding to bi = ( bJ Y ,..., bj^ and respectively). The 
random effects by and and the measurement error vector e t j are 
assumed to be mutually independent. In Eq. (9), X, 17, and Fare 
known design matrices of the corresponding fixed or random 
effects. The dimensions qy and q c of the random effects terms by 
and Ci are determined by the design matrices U and V which, along 
with X and IT, specify the experimental design to be adopted. 

With the LMM, the distributions of by and c t are taken, respec¬ 
tively, to be multivariate normal Nq b (0->@bilq b ) and 0^7^), 

where I qb and are identity matrices with dimensions being 
specified by the subscripts. The measurement error vector e tJ is 
also taken to be multivariate normal N p (0, where 

Ai = diag {Hq)i) is a diagonal matrix constructed from the vector 

withcp • = *,) and His a known p x q e zero-one 

design matrix. That is, we allow the ith component-variance to be 
different among the p microarray experiments. 

The vector of unknown parameters can be obtained by 
maximum likelihood via the EM algorithm, proceeding condition¬ 
ally on the tissue-specific random effects c t . The E- and M-steps can 
be implemented in closed form. In particular, an approximation to 
the E-step by carrying out time-consuming Monte Carlo methods 
is not required. A probabilistic or an outright clustering of the 
genes into jy components can be obtained, based on the estimated 
posterior probabilities of component membership given the profile 
vectors and the estimated tissue-specific random effects 
Ci(i f 5 ' ' ' • 

To illustrate this method, we report here an example from [37] 
who extended the EMMIX-WIRE model to incorporate first-order 
autoregressive AR(1) random effects for clustering some time- 
course data from the yeast cell-cycle study of Cho et al. [38]. The 
data consist of the expression levels of237 genes over two cycles for 
the yeast cells at p = 17 time points, sampling at 10-min intervals, 
where the raw data were log transformed and normalized by col¬ 
umns and rows. A general form of the first-order Fourier series 
expansion is adopted to model periodic gene expression [39]. With 
reference to Eq. (9), the design matrix was taken be an 17 x 2 
matrix (m —2) with the (/ + l)th row (/ = 0,.. .,16) 

(cos (2/r(10/)/&> + 0) sin (2/r(10/)/<w + 0)), (10) 

where the period of the cell cycle co was taken to be 85 and the 
phase offset 0 was set to zero. The design matrices for the random 
effects parts were specified as U — I\y and V = lyj. That is, it is 
assumed that there exist random gene effects by with q b = 17 and 
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random temporal effects Ci = (cn, ... T with q c = p = 17. The 
latter introduce dependence among expression levels within the 
same cluster obtained at the same time point. Also, H— I 17 and 
(Pi = < 7 ? ( q e = 1 ) so that the component variances are common 
among the p = 17 experiments. To account for the time dependent 
random gene effects, an AR(1) correlation structure is adopted for 
the gene profiles, so that follows a N(0, 6iA(pi)) distribution, 
where 


/ 1 

Pi ■ 

•• p} 6 \ 


Pi 

1 

• 


\pf 

pf • 
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The inverse of A(p^) can be expressed as 

= ( 12 ) 
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2 Pi 
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(13) 


In Eq. (12), all 7, /, and K are 17 x 17 matrices, where I is the 
identity matrix, / has its sub-diagonal entries ones and zeros else¬ 
where, and Intakes on the value 1 at the first and last elements of its 
principal diagonal and zeros elsewhere; see [37] for detailed 
derivation. 


With this data set, the 237 genes were categorized with respect 
to the four categories in the MIPS database (DNA synthesis and 
replication, organization of centrosome, nitrogen and sulfur 
metabolism, and ribosomal proteins); see [40]. The clustering 
results using the extended EMMIX-WIRE model for g = 4 are 
given in Fig. 5, where the expression profiles for genes in each 
cluster are presented. The adjusted Rand index [36], for assessing 
the degree of agreement between the clustering results and the four 
categories of genes, was 0.6189, which is the best match (the 
largest index) compared with several model-based and hierarchical 
clustering algorithms considered in [40]; see [37] for more details. 


3 Notes 


1. For both procedures, as with other partitional clustering meth¬ 
ods, the number of clusters^ needs to be specified at the outset. 
As both procedures are model-based, we can make a choice as to 
an appropriate value of g by consideration of the likelihood 
function. In the absence of any prior information as to the 
number of clusters present in the data, we monitor the increase 
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Fig. 5 Clustering results for Cho’s yeast cell cycle data. For all the plots, the x-axis is the time point and the y- 
axis is the gene-expression level 


in the log likelihood function as the value of g increases. At 
any stage, the choice of g — g 0 versus^ = g^ for instance^ = 
go + 1 , can be made by either performing the likelihood ratio 
test or by using some information-based criterion, such as BIC 
(Bayesian information criterion). Unfortunately, regularity con¬ 
ditions do not hold for the likelihood ratio test statistic X to have 
its usual null distribution of chi-squared with degrees of free¬ 
dom equal to the difference d in the number of parameters for 
g = jji andg = g 0 components in the mixture models. One way 
to proceed is to use a resampling approach as in [41]. Alterna¬ 
tively, one can apply BIC, which leads to the selection ofg = gi 
over g = g 0 if —21og A is greater than d log(^). The value of d is 
obvious in applications of EMMIX-GENE, but is not so clear 
with applications of EMMIX-WIRE, due to the presence of 
random effects terms [36]. 
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2. The most time-consuming step of the three steps is the gene 
selection step of EMMIX-GENE. This step is slower than the 
others as a mixture of two normals is fitted for each gene, instead 
of a multivariate normal being fitted to a group of genes or a 
metagene simultaneously. A faster but ad hoc selection step is to 
make the decision for each gene on the basis of the interquartile 
range of the gene expression values over the tissues. 
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Parameterized Algorithmics for Finding Exact 
Solutions of NP-Hard Biological Problems 

Falk Hiiffner, Christian Komusiewicz, Rolf Niedermeier, 
and Sebastian Wernicke 


Abstract 

Fixed-parameter algorithms are designed to efficiently find optimal solutions to some computationally hard 
(NP-hard) problems by identifying and exploiting “small” problem-specific parameters. We survey practical 
techniques to develop such algorithms. Each technique is introduced and supported by case studies of 
applications to biological problems, with additional pointers to experimental results. 

Key words Computational intractability, NP-hard problems, Algorithm design, Exponential running 
times, Discrete problems, Fixed-parameter tractability, Optimal solutions 


1 Introduction 


Many problems that emerge in bioinformatics require vast amounts 
of computer time to be solved optimally. An illustrative example, 
though somewhat oversimplified, would be the following: Given a 
set of n experiments of which some pairs have conflicting results 
(that is, at least one result must be wrong), identify a minimum-size 
subset of experiments to eliminate such that no conflict remains. 
This problem, while simple to describe, has no known algorithm 
that solves it efficiently on all inputs. From a theoretical standpoint, 
such computational hardness can be traced back to the NP -hardness 
of a problem. Assuming a widely believed conjecture in complexity 
theory, the classification of a computational problem as NP-hard 
implies that the time needed to solve it grows very quickly (usually 
exponentially) with the input size [61]. However, the demand to 
solve NP-hard problems commonly arises in practical settings, 
including bioinformatics. To obtain solutions to these problems 
despite their NP-hardness, it is common to sacrifice solution quality 
for efficiency, for example, by employing heuristic algorithms or 
approximation algorithms. A different approach is to insist on exact 
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solutions and accept that the algorithm will not be efficient on all 
inputs but hopefully on those that arise in the application at hand. 

Most theory on computational hardness is based on the 
assumption that the difficulty of solving an instance of a computa¬ 
tional problem is determined by the size of that instance. The 
crucial observation this chapter is based on is that often it is not 
the size of an instance that makes a problem computationally hard 
to solve, but rather its structure. Parameterized algorithmics renders 
this observation precise by quantifying structural hardness with the 
so-called parameters , typically a nonnegative integer variable 
denoted by k or a tuple of such variables. A parameterized problem 
is then called fixed-parameter tractable (FPT) if it can be solved 
efficiently when the parameter is small; the corresponding algo¬ 
rithm is called fixed-parameter algorithm. The concept of fixed- 
parameter tractability thus formalizes and generalizes the concept 
of “tractable special cases” that are known for virtually all NP-hard 
problems. For example, as we will discuss in more detail below, our 
introductory problem can be solved quickly whenever the number 
of conflicting experiments is small (a reasonable assumption in 
practical settings, since the results would otherwise not be worth 
much anyway). 

Often, there are many possible parameters to choose from. For 
example, for solving our introductory problem we could choose the 
maximum number of conflicts for a single experiment to be the 
parameter or, alternatively, the size of the largest group of pairwise 
conflicting experiments. This makes parameterized algorithmics a 
multipronged attack that can be adapted to different practical 
applications. Of course, not all parameters lead to efficient algo¬ 
rithms; in fact, parameterized algorithmics also provides tools to 
classify parameters as “not helpful” in the sense that we cannot 
expect provably efficient algorithms even when these parameters are 
small. 

Fixed-parameter algorithms have by now facilitated many suc¬ 
cess stories and several techniques have emerged as being applicable 
to large classes of problems [81]. This chapter presents several of 
these techniques, namely kernelization (Subheading 2), depth- 
bounded search trees (Subheading 3), dynamic programming 
(Subheading 4), tree decompositions of graphs (Subheading 5), 
color-coding (Subheading 6), and iterative compression (Subhead¬ 
ing 7). We start each section by introducing the basic concepts and 
ideas, followed by some case studies concerning practically relevant 
bioinformatics problems. Concluding each section, we survey 
known applications, implementations, and experimental results, 
thereby highlighting the strengths and fields of applicability for 
each technique. 

Another commonly used strategy for exactly solving NP-hard 
problems is to reduce the problem at hand to “general-purpose 
problems” such as integer linear programming [7, 8] and 
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1.1 Computational 
Complexity Theory 


satisfiability solving [14, 105, 127]. For these, there exist highly 
optimized tools with years of algorithm engineering effort that 
went into their development. Therefore, if an NP-hard problem 
can be efficiently expressed as one of these general-purpose pro¬ 
blems, these tools might be able to find an optimum solution 
without the need for any further algorithm design. In many appli¬ 
cation scenarios, it will actually make sense to try and combine these 
general-purpose approaches and the more problem-specific 
approach of parameterized algorithmics since the specific advantage 
of fixed-parameter algorithms is that they are usually crafted 
directly for the problem at hand and thus may allow a better 
exploitation of problem-specific features to substantially gain effi¬ 
ciency. In particular, the polynomial-time data reduction techni¬ 
ques that are introduced in Subheading 2 usually combine nicely 
and productively with the more general solver tools. 

Before discussing the main techniques of fixed-parameter algo¬ 
rithms in the following sections, the remainder of this section 
provides a crash course in computational complexity theory and a 
few formal definitions related to parameterized complexity analysis. 
Furthermore, some terms from graph theory are introduced, and 
we present our running example problem vertex cover. 

In this survey, we are concerned with efficiently solving computa¬ 
tional problems. A standard format for specifying these problems is 
to phrase them in an “Input/Task” way that formally specifies the 
input and desired output. A core topic of computational complexity 
theory is the evaluation and comparison of different algorithms for 
a given problem [106,113]. Since most algorithms are designed to 
work with variable inputs, the efficiency (or complexity) of an algo¬ 
rithm is not just stated for some concrete inputs ( instances ), but 
rather as a function that relates the input length n to the number of 
steps that are required to execute the algorithm. Generally, this 
function is given in an asymptotic sense, the standard way being 
the bijy-O notation where we write j{n) = 0(jj(n)) to express that/ 
( n )/S( n ) is upper-bounded by a positive constant in the limit for 
large n [45, 86, 122]. Since instances of the same size might take 
different amounts of time, it is implicitly assumed in this chapter 
that we are considering the worst-case running time among all 
instances of the same size; that is, we deliberately exclude from 
our analysis the potentially efficient solvability of some specific 
input instances of a computational problem. 

Determining the computational complexity of problems 
(meaning the best possible worst-case running time of an algorithm 
for them) is a key issue in theoretical computer science. Herein, it is 
of central importance to distinguish between problems that can be 
solved efficiently and those that presumably cannot. To this end, 
theoretical computer science has coined the notions of polynomial- 
time solvability ; on the one hand, and NP -hardness^ on the 
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other [61]. Here, polynomial-time solvability means that for every 
siz c-n input instance of a problem, an optimal solution can be 
computed in n °^ time. In contrast, the (unproven, yet widely 
believed) working hypothesis of theoretical computer science is 
that NP-hard problems cannot be solved in n °^ time. More 
specifically, typical running times for NP-hard problems are of the 
form 0( c n ) for some constant c > l; that is, we have an exponential 
growth in the number of computation steps as instances grow 
larger. In this sense, polynomial-time solvability has become a 
synonym for efficient solvability. 

As there are thousands of known NP-hard optimization pro¬ 
blems and their number is continuously growing [106, 114], sev¬ 
eral approaches have been developed that try to circumvent the 
assumed computational intractability of NP-hard problems. One 
such approach is based on polynomial-time approximation algo¬ 
rithms, where one gives up seeking optimal solutions in order to 
have efficient algorithms [9, 128, 131]. Another common strategy 
is to use heuristics, where one gives up provable performance 
guarantees (concerning running time or solution quality) by devel¬ 
oping algorithms that behave well in “most” practical applica¬ 
tions [104, 106]. 

1.2 Parameterized For many applications, the compromises inherent to approximation 
Complexity algorithms and heuristics are not satisfactory. Fixed-parameter 

algorithms can provide an alternative by providing exact solutions 
with useful running time guarantees [53, 59, 109]. The core con¬ 
cept is formalized as follows: 

Definition 1 . A parameterized problem instanee consists of a problem 
instance I and a parameter k. A parameterized problem is fixed- 
parameter tractable if it can be solved in f(lq) • | I | 0( " iy) time, where f 
is a (computable) function solely depending on the parameter k. 

For NP-hard problems, fk) will typically be an exponential 
function like 2 k rather than a polynomial function. 

Note the difference between “fixed-parameter tractable” 
and “polynomial-time solvable for fixed k”: an algorithm running 
in | J | time demonstrates that a problem is polynomial-time 
solvable for any fixed k, but does not show fixed-parameter tracta- 
bility since the degree of the polynomial depends on k; ideally, a 
fixed-parameter algorithm provides a linear-time algorithm for each 
fixed k [126]. 

As an example for this “parameterized perspective,” consider 
again the identification of k faulty experiments among n experi¬ 
ments. We could naively solve this problem in 0(2”) time by trying 
all possible subsets of the n experiments. However, this would not 
be practically feasible for n > 40. In contrast, a simple fixed- 
parameter algorithm with running time 0(2^ • n) exists for this 
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1.3 Graph Theory 


problem, which allows it to be solved even for n > 1000, as long as 
k < 20 (as we will discuss in Subheading 2.4, real-world instances 
can often be solved for much larger values of k by an extension of 
this approach). 

Unfortunately, there are parameterized problems for which 
there is good evidence that they are not fixed-parameter tractable 
{see Note 1). 

Typically, a problem allows for more than one parameteriza¬ 
tion [90, 110]. From a theoretical point of view, parameterization 
is a key to better understand the nature of computational intracta¬ 
bility. The ultimate goal here is to learn how parameters influence 
the computational complexity of problems. The more we know 
about these interactions, the more likely it becomes to cope with 
computational intractability. In a sense, it may be considered as an 
art to find the most useful parameterizations of a computational 
problem. 

From an applied point of view, the identification of parameters 
for a concrete problem should go hand-in-hand with an extensive 
data analysis. One natural way for spotting relevant parameteriza¬ 
tions of a problem in real-world applications is to analyze the given 
input data and check which quantifiable aspects of it appear to be 
small and might thus be suitable as parameters. For example, if the 
input is a network, one such observable parameter could be the 
maximum vertex degree. Often, real-world input instances also 
carry some hidden structure that might be exploited. Again turning 
to graphs, well-known parameters such as “feedback vertex set 
number” or “treewidth” measure how tree-like a graph is. These 
parameters are motivated by the observation that many intractable 
graph problems become tractable when restricted to trees. For NP- 
hard string problems, which also occur frequently in bioinformat¬ 
ics, natural parameters are, for example, the size of the alphabet or 
the number of occurrences of a letter [35]. 

Many of the problems we deal with in this work can be formulated in 
graph-theoretic terms [48, 129]. An undirected graph G = {V, E) 
is given by a set of vertices V and a set of edges E, where each edge 
{r, if} is an undirected connection of two vertices v and w. Through¬ 
out this work, we use n\ — | V | to denote the number of vertices 
and m: = | E | to denote the number of edges. For a set of verti¬ 
ces V' C V, the induced subgmph G[ V'] is the graph (F', {{r, w } £ 
E\v, w £ F'}), that is, the graph G restricted to the vertices in V'. 
We denote the open neighborhood of a vertex v by N(v ): = { u\ {w, v) 
£ E] and its closed neighborhood by N[v]: = N(v) U{ v }. 

It is not hard to see that we can formalize our introductory 
problem of recognizing faulty experiments as a graph problem 
where vertices correspond to experiments and edges correspond 
to pairs of conflicting experiments. Thus, we need to choose a small 
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Fig. 1 A graph with a size-8 vertex cover (cover vertices are marked blacK) 

set of vertices (the experiments to eliminate) so that each edge is 
incident with at least one chosen vertex. This is known as the NP- 
hard vertex cover problem, which serves as a running example for 
several techniques in this work. 

VERTEX COVER 

Input: An undirected graph G = ( V, E) and a nonnegative integer k. 

Task: Find a set C C Fof at most k vertices such that each edge in E has at 

least one of its endpoints in C. 

The problem is illustrated in Fig. 1. vertex cover can well be 
considered a poster child of fixed-parameter research, as many 
discoveries that influenced the whole field originated from the 
study of this single problem. 


2 Kernelization: Data Reduction with Guaranteed Effectiveness 

The idea of data reduction is to quickly presolve those parts of a 
given problem instance that are easy to cope with, shrinking it to 
those parts that form its hard core [73, 94]. Computationally 
expensive algorithms need then only be applied to this core. In 
some practical scenarios, data reduction may even reduce instances 
of a seemingly hard problem to triviality. Once an effective (and 
efficient) reduction rule has been found, it is typically not only 
useful in the context of parameterized algorithmics, but also in 
other problem solving contexts, whether they be heuristic, approx¬ 
imative, or exact. 

This section introduces the concept of kernelization , that is, 
polynomial-time data reduction with guaranteed effectiveness. Ker¬ 
nelization is closely connected to fixed-parameter tractability and 
emerges within its framework. 

2.1 Basic Concepts There are many examples of combinatorial problems that would 
not be solvable without employing heuristic data reduction and 
preprocessing algorithms. For example, commercial solvers for 
hard combinatorial problems such as the integer linear program 
solver CPLEX heavily rely on data-reducing preprocessors for their 
efficiency [15]. Obviously, many practitioners are aware of the 
general concept of data reduction. Parameterized algorithmics 
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2.2.1 A Simple 
Kernelization for Vertex 
Cover 


adds to this by providing a way to use data reduction rules not only 
heuristically, but also with guaranteed performance quality. These 
so-called kernelizations guarantee an upper bound on the size of the 
reduced instance, which solely depends on the parameter value. 
More precisely, the concept is defined as follows: 

Definition 2 ([53, 109]). Let I be an instance of a parameterized 
problem with given parameter k. A reduction to a problem kernel (or 
kernelization ) is a polynomial-time algorithm that replaces I by a new 
instance Y and k by a new parameter k' such that 

• the size ofV and the value ofk' are guaranteed to only depend on 
some function of k, and 

• the new instance I' has a solution with respect to the new param¬ 
eter k' if and only if I has a solution with respect to the original 
parameter k. 

Kernelizations can help to understand the practical effective¬ 
ness of some data reduction rules and, conversely, the quest for 
kernelizations can lead to new and powerful data reduction rules 
based on deep structural insights. 

Intriguingly, there is a close connection between fixed- 
parameter tractable problems and those problems for which there 
exists a kernelization—in fact, they are exactly the same [36]. 
Unfortunately, the running time of a fixed-parameter algorithm 
directly obtained from a kernelization is usually not practical and, 
in the other direction, there exists no constructive scheme for 
developing data reduction rules for a fixed-parameter tractable 
problem. Nevertheless, this equivalence can establish the fixed- 
parameter tractability and amenability to kernelization of a problem 
by knowing just one of these two properties. 

In this section, we first illustrate the concept of kernelization by a 
simple example concerning the vertex cover problem. We then 
show a more involved kernelization algorithm for the graph clus¬ 
tering problem cluster editing. Finally, we discuss the limits of the 
kernelization approach for fixed-parameter tractable problems and 
present an extension of the kernelization concept that can be used 
to cope with the nonexistence of problem kernels. 

Consider our running example vertex cover. In order to cover an 
edge in the graph, one of its two endpoints must be in the vertex 
cover. If one of these is a degree-1 vertex (that is, it has exactly one 
neighbor), then the other endpoint has the potential to cover more 
edges than this degree-1 vertex, leading to a first data reduction 
rule. 


Reduction Rule VC1 

If there is a degree-1 vertex, then put its neighboring vertex into the cover. 
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2.2.2 A Kernelization for 
Cluster Editing 


Here, “put into the cover” means adding the vertex to the 
solution set and removing it and its incident edges from the 
instance. Note that this reduction rule assumes that we are only 
looking for one optimal solution to the vertex cover instance we 
are trying to solve; there may exist other minimum vertex covers 
that do include the reduced degree- 1 vertex (see Note 2). 

After having applied Rule VC1, we can further do the following 
in the fixed-parameter setting where we ask for a vertex cover of size 
at most k. 

Reduction Rule VC2 

If there is a vertex v of degree at least k + 1, then put v into the cover. 

The reason this rule is correct is that if we did not take v into the 
cover, then we would have to take every single one of its k + 1 
neighbors into the cover in order to cover all edges incident with v. 
This is not possible because the maximum allowed size of the cover 
is k. 

After exhaustively performing Rules VC1 and VC2, all vertices 
in the remaining graph have degree at most k. Thus, at most k edges 
can be covered by choosing an additional vertex into the cover. 
Since the solution set may be no larger than £, the remaining graph 
can have at most £ 2 edges if it has a solution. Clearly, we can assume 
without loss of generality that there are no isolated vertices (that is, 
vertices with no incident edges) in a given instance. In conjunction 
with Rule VC1, this means that every vertex has degree at least two. 
Hence, the remaining graph can contain at most £ 2 vertices. 

Stepping back, what we have just done is the following: After 
applying two polynomial-time data reduction rules to an instance of 
vertex cover, we arrived at a reduced instance whose size can be 
expressed solely in terms of the parameter k. Hence, considering 
Definition 2, we have found a kernelization for vertex cover. 

In the above example kernelization for vertex cover, there is a 
notable difference between Rules VC1 and VC2: Rule VC1 is 
based on a local optimality argument whereas Rule VC2 makes 
explicit use of the parameter k. In applications, the first type of 
data reduction rules is usually preferable, as they can be applied 
independently of the value of k and this value is only used in the 
analysis of the power of the data reduction rules. For the NP-hard 
graph clustering problem cluster editing, we now present an 
efficient kernelization algorithm that is based solely on a data 
reduction rule of the first type. 

CLUSTER EDITING 

Input: An undirected graph G=(V,E), an edge-weight function 
co : V 2 —> AT + , and a nonnegative integer k. 

Task: Find whether we can modify G to consist of vertex-disjoint cliques 
(that is, fully connected components) by adding or deleting a set of edges 
whose weights sum up to at most k. 
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Fig. 2 Illustration for cluster editing with unit weights: By removing two edges 
from and adding one edge to the graph on the left (that is, k = 3), we can obtain 
a graph that consists of two vertex-disjoint cliques 

cluster editing can be used, for example, to cluster proteins 
with high sequence similarity [23] and to identify cancer sub- 
types [133]; a comprehensive overview of its applications is given 
by Boclcer and Baumbach [20]. Many theoretical studies consider 
the case in which all edges have weight one, but the weighted 
version of cluster editing is more relevant in biological applica¬ 
tions. The positive edge weights describe the cost to delete an 
existing edge or to insert a missing edge, respectively. Figure 2 
shows an instance of the unweighted problem variant together 
with a solution. A simple kernelization for cluster editing uses 
similar high-degree reduction rules as the vertex cover kerneliza¬ 
tion described above. These rules yield a kernel with 0(k 2 ) verti¬ 
ces [67]. This bound can be improved to O(k) vertices using 
reduction rules whose correctness is based on local optimality 
arguments. We now describe such a kernelization algorithm for 
cluster editing that was developed by Cao and Chen [39]. 

The idea of this kernelization is to examine for each vertex v of 
the graph whether its neighborhood is already very dense and only 
loosely connected to the rest of the graph. If this is the case, then it 
is optimal to put all neighbors of v in the same cluster as v. This 
knowledge can be used to identify edges that have to be deleted or 
edges that have to be added. Formally, the algorithm computes the 
sum of the weights of the missing edges in the neighborhood N[v]; 
this number is denoted by 8(v). Then it computes the sum of the 
edge weights between N[v] and T\N[v]; this number is denoted 
by y(v). These two measures are combined to form what is called 
the stable cost of a vertex v defined as c(v) = 2 8(v) +/(v). Now a 
vertex v is called reducible if c(v) < | N[v] | . The main conse¬ 
quence of being reducible is that if N[v] is reducible, then there is 
an optimal solution such that N[v] is contained in a single cluster. 
This implies the correctness of the following data reduction rules; 
an example application of the first two reduction rules is given 
in Fig. 3. 

The first rule adds missing edges in neighborhoods of reducible 
vertices. 

Reduction Rule CE1 

If there is a reducible vertex v and a pair of vertices », xm N[v] that are not 

neighbors, then add {», x} to G and decrease k by go({u, #}). 
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Rule CE1 Rule CE2 



V U W V U W V u w 


Fig. 3 The application of Reduction Rules CE1 and CE2 to an instance of cluster editing. In the example, the 
weight of all existing and missing edges is 1. Initially, the vertex v is reducible. Then, Rule CE1 inserts the 
missing edge in N[v\. Subsequently, Rule CE2 deletes the edge between t/and w, since it is more costly to 
make w adjacent to all vertices of N[v\ than to delete this edge 

The next rule finds vertices that have some but only few neigh¬ 
bors in ,V| r |. In an optimal solution, these vertices are never in the 
same cluster as N[ v]. Thus, the edges between these vertices 
and N[ v ] may be deleted. 

Reduction Rule CE2 

If there is a reducible vertex v and a vertex u ^ N[v] such that it is more costly 
to add all missing edges between u and N[v] than to remove all edges 
between u and N[v], then remove all edges between u and N[v] and 
decrease k accordingly. 

The final rule merges N[v ] into one vertex and adjusts the edge 
weights accordingly. 

Reduction Rule CE3 

If there is a reducible vertex v to which Rules CE1 and CE2 do not apply, 
then merge N[v] into a single vertex 7 /. For each vertex u £ V\ N[v] 
set co({u, 1 /}): = € x}). 

As long as the instance contains a reducible vertex, the reduc¬ 
tion rules will either modify an edge or merge a vertex. If there are 
no more reducible vertices, then the last trivial step of the kerneli- 
zation algorithm is to remove all isolated vertices from the instance. 
Afterwards, the instance has at most 2k vertices. The intuition 
behind this size bound is the following. Every edge has weight at 
least one, so a solution contains at most k edges. Since each edge 
has two endpoints, the modifications can affect at most 2k vertices. 
Now if in a cluster every vertex is affected, then the size of the 
cluster is at least two times the number of edge modifications 
within the cluster plus the number of edge modifications between 
this cluster and other clusters. If there is a vertex v in the cluster that 
is not affected by the solution, then the same bound on the cluster 
size holds, in this case because v is not reducible. Summing these 
size bounds over all clusters, we obtain a sum of edges in which each 
solution edge appears at most twice. This gives the size bound of 2k 
vertices. 

2.3 Limits and The two example problems vertex cover and cluster editing are 

Extensions of especially amenable to kernelization since they admit polynomial- 

Kernelization size problem kernels. That is, the size bound for the kernel is a 

polynomial function in the parameter k. While all fixed-parameter 
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Fig. 4 A graph with a 2-club of size six (marked blacK) 

tractable problems admit a problem kernelization, it is not the case 
that all fixed-parameter tractable problems admit polynomial ker¬ 
nels [53, 94]. It is beyond the scope of this paper to introduce the 
proof techniques for showing nonexistence of polynomial problem 
kernels. We will give, however, an example of a biologically moti¬ 
vated graph problem that does not admit a polynomial problem 
kernel and describe one way of circumventing this hardness result. 

2-CLUB 

Input: An undirected graph G = ( V, E) and a nonnegative integer k. 

Task: Find a set S C Fof at least k vertices such that the subgraph induced 

by S has diameter at most two. 

The NP-hard 2-club problem attempts to identify large cohe¬ 
sive subgroups of an input graph. The idea behind the formulation 
is to relax the overly restrictive definition of the clique problem 
which only accepts solutions that are complete graphs or, equiva¬ 
lently, that have diameter one. The 2-club problem finds applica¬ 
tions in the detection of protein interaction complexes [115]; an 
instance of 2-club with a maximum-cardinality solution is shown 
in Fig. 4. 

As we will see, 2-club is fixed-parameter tractable with respect 
to the parameter solution size k. It does not, however, admit a 
polynomial problem kernel for this parameter [119] (see Note 3 
for a brief discussion). 

In spite of this hardness result, one can still perform a useful 
parameterized data reduction for 2-club. The idea is to reduce the 
problem to many problem kernels instead of just one. This 
approach is called Turing kernelization. In the case of 2-club, the 
Turing kernelization consists of two simple parts. First, one looks 
for a trivial solution using the following observation: for every 
vertex vm a graph, its closed neighborhood N[v] has diameter two. 

Reduction Rule 2-C 

If there is a vertex v with at least k — 1 neighbors, then return N[v]. 

After this rule has been applied, we have either obtained a 
solution or the maximum degree of the graph is at most k — 2. 
Now, the Turing kernelization uses only one further observation: 
To find a largest 2-club it is sufficient to examine for each vertex v of 
the input graph G the subgraph of G that contains only the vertices 
which have distance at most two to v. We can now use the fact 
that the maximum degree is bounded: every vertex v has at 
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most k — 2 neighbors and each of these has at most k — 3 further 
neighbors. Thus, 2-club can be solved by independently solving n 
small instances with 0(k 2 ) vertices each. Formally, this means that 2 - 
club admits a Turing kernelization with 0(k 2 ) vertices. 

2.4 Applications and Solving vertex cover is relevant in many bioinformatics-related 

Implementations scenarios such as analysis of gene expression data [44] and the 

computation of multiple sequence alignments [40]. Besides solving 
instances of vertex cover, another application of vertex cover 
kernelizations is to search maximum-cardinality cliques (that is, 
maximum-size complete subgraphs) in a graph. Here, use is made 
of the fact that an ^-vertex graph Ghas a clique of size (n — k) if and 
only if its complement graph, that is, the graph that contains exactly 
the edges not contained in G, has a size-£ vertex cover. The best 
known kernel for vertex cover (up to minor improvements) has 2k 
vertices [108]. Abu-Khzam et al. [1] studied various kernelization 
methods for vertex cover and their practical performance on 
biological networks with respect to running time and resulting 
kernel size. Experimental results for the computation of large cli¬ 
ques via vertex cover are given, for example, by Abu-Khzam 
et al. [2]. 

Several kernelization approaches including the one presented 
in Subheading 2.2.2 have been implemented for cluster edit¬ 
ing [25, 76]. The Turing kernelization for 2 -club was implemented 
and experimentally evaluated; it turned out to be a crucial ingredi¬ 
ent for obtaining an efficient algorithm for this problem [77]. 
Another biologically relevant clustering problem where kerneliza¬ 
tions have been successfully implemented is the clique cover prob¬ 
lem. Here, the task is to cover all edges of a graph using at most k 
cliques (these may overlap). Using data reduction, Gramm 
et al. [69] solved even large instances with 1 000 vertices and 
k « 6 000 as long as they are sparse (m « 7 000). 


3 Depth-Bounded Search Trees 

Once data reductions as discussed in the previous section have been 
applied to a problem instance, we are left with the “really hard” 
problem kernel to be solved. A standard way to explore the huge 
search space of a computationally hard problem is to perform a 
systematic exhaustive search. This can be organized in a tree-like 
fashion, which is the subject of this section. 

3.1 Basic Concepts Search tree algorithms—also known as backtracking algorithms, 
branching algorithms, or splitting algorithms—certainly are no 
new idea and have extensively been used in the design of exact 
algorithms (e.g., see [45, 60, 122]). The main contribution of 
parameterized algorithmics to search tree algorithms is the 
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3.2 Case Studies 


3.2.1 Vertex Cover 
Revisited 


consideration of search trees whose depth is constrained by a func¬ 
tion in the parameter. Combined with insights on how to find 
useful—and possibly non-obvious—parameters, this can lead to 
search trees that are much smaller than those of naive brute-force 
searches. For example, a very naive search tree approach for solving 
vertex cover is to just take one vertex and branch into two cases: 
either this vertex is in the vertex cover or not. For an ^-vertex 
graph, this leads to a search tree of size 0( 2 n ). As we outline in 
this section, we can do much better than that and obtain a search 
tree whose depth is upper-bounded by &, giving a size bound 
of 0( 2 k ). Extending what we discuss here, even better search trees 
of size 0(1. 28^) are possible [43]. Since usually k <C w, this can 
draw the problem into the zone of feasibility even for large graphs. 

Besides depth-bounding, parameterized algorithmics provides 
additional means to provably improve the speed of search tree 
exploration, particularly by interleaving this exploration with ker- 
nelizations, that is, applying data reduction to partially solved 
instances during the exploration. 

Starting with our running example vertex cover, this section 
introduces the concept of depth-bounded search trees by three 
case studies. 

For many search tree algorithms, the basic idea is to find a small 
subset of the input instance in polynomial time such that at least 
one element of this subset must be part of an optimal solution to 
the problem. In the case of vertex cover, the most simple such 
subset is any set of two adjacent vertices. By definition of the 
problem, one of these two vertices has to be part of a solution, or 
the respective edge would not be covered. Thus, a simple search- 
tree algorithm to solve vertex cover on a graph G = ( V, E) can 
proceed by picking an arbitrary edge e = { v, w} and recursively 
searching for a vertex cover of size k — 1 both in G[T\{e}] and 
G[V\{w}], that is, in the graphs obtained by removing either v and 
its incident edges or w and its incident edges. In this way, the 
algorithm branches into two subcases knowing one of them must 
lead to a solution of size at most k (provided that it exists). 

As shown in Fig. 5, the recursive calls of the simple vertex 
cover algorithm can be visualized as a tree structure. Because the 
depth of the recursion is upper-bounded by the parameter value 
and we always branch into two subcases, the number of cases that 
are considered by this tree—its size, so to say—is 0( 2 k ). Indepen¬ 
dent of the size of the input instance, it only depends on the value 
of the parameter k. 

The currently “best” search trees for vertex cover have worst- 
case size 0( 1.28*) [43] and are mainly achieved by elaborate case 
distinctions. These algorithms consist of several branching rules; 
for example, the degrees of the endpoints of an edge determine 
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3.2.2 A Search Tree 
Algorithm for Cluster 
Editing 






Fig. 5 Simple search tree for finding a vertex cover of size at most k in a given 
graph. The size of the tree is 0(2*) 



A>4>° £_ 

U V W X w 

Fig. 6 The merge branching for cluster editing. In the example instance, all edges 
and missing edges have unit weight, except {v, w\ which has weight 2. In one 
branch, the edge [u, v] is deleted and k is reduced by 1. In the other branch, u 
and i/are merged and the edge weights are adjusted. For example, edge {x, w\ 
obtains weight 1 since the missing edge {u, w] had weight 1. Accordingly, /c is 
decreased by 1. All missing edges between x and other vertices have weight 2 


which of the branching rules is applied. However, for practical 
applications it is always concrete implementation and testing that 
has to decide whether the administrative overhead caused by dis¬ 
tinguishing more and more cases pays off. A simpler algorithm with 
slightly worse search tree size bounds may be preferable. 

For vertex cover, we have found a depth-bounded search tree by 
observing that at least one endpoint of any given edge must be part 
of the cover. A somewhat similar approach can be used to derive a 
depth-bounded search tree for cluster editing (Fig. 6). 

Recall that the aim for cluster editing is to modify a graph into 
a cluster graph, that is, a vertex-disjoint union of cliques, by 










Parameterized Algorithmics for Finding Exact Solutions of NP-Hard Biological Problems 


377 


3.2.3 The Closest String 
Problem 


modifying edges whose weight sums up to at most k. Similar to 
vertex cover, a search tree for cluster editing can be obtained by 
noting that the desired graph of vertex-disjoint cliques forbids a 
certain structure: If two vertices in a cluster graph are adjacent, then 
their neighborhoods must be the same. Hence, whenever we 
encounter two vertices u and v in the input graph G that are 
adjacent and where one vertex, say v, has a neighbor w that is not 
adjacent to u, we are compelled to do one of three things: Either 
remove the edge \u, v], or add the edge [u, it}, or remove the 
edge [v, w}. Note that each such modification incurs a cost of at 
least one. Therefore, exhaustively branching into three cases, each 
time decreasing k by one, we obtain a search tree of size 0(3*) to 
solve cluster editing. Using computer-aided algorithm design, 
this idea can be improved to obtain, for the unit-weight case, a 
search tree of size 0(1.92*) [66]. The current-best theoretical 
running time is, however, achieved by exploiting the fact that 
edge weights make it possible to consider the merging operation 
in a search tree algorithm. The observation is that in the presence of 
a conflict as described above, one may either delete the edge {u, v} 
or, otherwise, u and v are in the same cluster of the final cluster 
graph. Thus, one may merge u and v and adjust the edge weights 
accordingly. The main trick is that when performing the merging, 
this still causes some cost: The edge {v, w) must be deleted or the 
edge [u, wj must be added. 

After merging u and v into a new vertex x one may thus 
“remember” that the new edge {x, wj will incur a cost irrespective 
of whether this edge is deleted or kept by a solution. With a more 
refined branching strategy, this idea leads to a search tree of size 
0(1.82*) for the general case [23] and of size 0(1. 62*) for the 
unit-weight case [19]. 

The closest string problem is also known as consensus string. 

CLOSEST STRING 

Input: A set of k length-f strings s±, ..s& and a nonnegative integer d. 

Task: Find a consensus string s that satisfies d H (s, s;) < d for all i = 1, .. k. 

Here, dn(s, Sj) denotes the Hamming distance between two 
strings and s t , that is, the number of positions where ^ and s t differ. 
Note that there are at least two immediately obvious parameteriza- 
tions of this problem. The first is given by choosing the “distance 
parameter” d and the second is given by the number of input 
strings k. Both parameters are reasonably small in various applica¬ 
tions; we refer to Gramm et al. [65] for more details. Here, we 
focus on the parameter d. 

closest string appears, for example, in primer design, where we 
try to find a small DNA sequence called primer that binds to a set of 
(longer) target DNA sequences as a starting point for replication of 
these sequences. How well the primer binds to a sequence is mostly 
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•#> 


...GGTGAG 

ATCTATAGAAGT 

TGAATGC... 

...GGTGGA 

ATCTACAGTAAC 

GGATTGT... 

...GGCGAG 

ATCTACAGAAGT 

GGAATGC... 

...GGCGAG 

ATCTATAGAGAT 

GGAATGC... 

...GGCAAG 

ATCTATAGAAGT 

GGAATGC... 


closest string: ATCTACAGAAAT 

primer candidate: TAGATGTCTTTA 

Fig. 7 Illustration to show how DNA primer design can be achieved by solving 
closest string instances on length-^ windows of aligned DNA sequences. The 
primer candidate is not the computed consensus string but its nucleotide-wise 
complement 

determined by the number of positions in that sequence that 
hybridize to it. While often done by hand, Stojanovic et al. [124] 
proposed a computational approach for finding a well-binding 
primer of length t. First, the target sequences are aligned, that is, 
as many matching positions within the sequences as possible are 
grouped into columns. Then, a “sliding window” of length t is 
moved over this alignment, giving a closest string problem for 
each window position. Figure 7 illustrates this (see [63] for details). 

In the remainder of this case study, we sketch a fixed-parameter 
search tree algorithm for closest string due to Gramm et al. [65], 
the parameter being the distance d. Unlike for vertex cover and 
cluster editing, the central challenge lies in even finding a depth- 
bounded search tree, which is not obvious at a first glance. Once 
found, however, the derivation of the upper bound for the search 
tree size is straightforward. The underlying algorithm is very simple 
to implement. 

The main idea behind the algorithm is to maintain a candidate 
strings* for the center string and compare it to the strings q, ..., 4 . 
If s* differs from some s t in more than d positions, then we know 
that s* needs to be modified in at least one of these positions to 
match the character that s t has there. Consider the following 
observation: 

Observation 1 . Let d be a nonnegative integer. If two strings s* 
and s j have a Hamming distance greater than 2d, then there is no 
string that has a Hamming distance of at most d to both of s,- and Sy. 

This means that s t is allowed to differ from s* in at most 2d 
positions. Hence, among any d + 1 of those positions where s t 
differs from j*, at least one must be modified to match s t . This 
can be used to obtain a search tree that solves closest string. 
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3.3 Applications and 
Implementations 


We start with a string from {.q, ..., 4 } as the candidate string j*, 
knowing that a center string can differ from it in at most d posi¬ 
tions. If s* already is a valid center string, we are done. Otherwise, 
there exists a string s t that differs from s* in more than d positions, 
but less than 2d. Choosing any d + 1 of these positions, we branch 
into (d + 1) subcases, each subcase modifying a position in s* to 
match q. This position cannot be changed anymore further down in 
the search tree (otherwise, it would not have made sense to make it 
match Si at that position). Hence, the depth of the search tree is 
upper-bounded by d , for if we were to go deeper down into the 
tree, then s* would differ in more than ^positions from the original 
string we started with. Thus, closest string can be solved by 
exploring a search tree of size 0((d + 1)^) [65]. Combining data 
reduction with this search tree, we arrive at the following: 

Theorem 1. center string can be solved in 0(k • £ + k • d • (d + 1 ) d ) 

time. 

It might seem as if this result is purely of theoretical interest— 
after all, the term (d + l) d becomes prohibitively large already for 
d = 15. Two things, however, should be noted in this respect: 
First, for one of the main applications of closest string, primer 
design, d is very small (often less than 4). Second, empirical analysis 
reveals that when the algorithm is applied to real-world and random 
instances, it often beats the proven upper bound by far, solving 
many real-world instances in less than a second. The algorithm is 
also faster than a simple integer linear programming formulation of 
closest string when the input consists of many strings and £ is 
small [65]. 

Unfortunately, many variants of closest string —roughly 
speaking, these deal with finding a matching .^string and distin¬ 
guish between strings to which the center is supposed to be close 
and to which it should be distant—are known to be intractable for 
many standard parameters [56, 68 , 103]. 

In combination with data reduction, the use of depth-bounded 
search trees has proven itself quite useful in practice, for example, 
allowing to find vertex covers of more than ten thousand vertices in 
some dense graphs of biological origin [2]. It should also be noted 
that search trees trivially allow for a parallel implementation: when 
branching into subcases, each process in a parallel setting can 
further explore one of these branches with no additional commu¬ 
nication required. Experimental results for vertex cover show 
linear speedups even for thousands of cores [3]. 

The merge-based search tree algorithm for cluster editing can 
solve many instances arising in the analysis of protein similarity 
data [23]; it is part of a software package [132]. A fixed-parameter 
search tree algorithm was also used to solve instances of the 
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minimum common string partition problem [34]. This NP-hard 
problem is motivated by applications in comparative genomics; the 
fixed-parameter algorithm was able to solve the problem on some 
bacterial genomes. The parameters exploited by the algorithm are 
the number of breakpoints and the maximum gene copy number in 
the genomes. Fixed-parameter search tree algorithms have also 
been applied for solving the maximum agreement forest problem 
which arises in the comparison of phylogenetic trees [130]; the 
fixed-parameter algorithm outperformed two previous approaches 
for maximum agreement forest, one using a formulation as integer 
linear program and another one using a formulation as satisfiability 
problem. Another example is the search for k-plexes in graphs, 
which can be used, for example, to model functional modules in 
protein interaction networks. By combining search trees with data 
reduction, it is often possible to outperform previously used 
methods [107]. 

Besides in parameterized algorithmics, search tree algorithms 
are studied extensively in the area of artificial intelligence and 
heuristic state space search. There, the key to speedups are admissi¬ 
ble heuristic evaluation functions which quickly give a lower bound 
on the distance to the goal. The reason that admissible heuristics 
are rarely considered by the parameterized algorithmics community 
in their works (see [64] for a counterexample) is that they typically 
cannot improve the asymptotic running time. Still, the speedups 
obtained in practice can be quite pronounced, as demonstrated for 
VERTEX COVER [57]. 

As with kernelizations, algorithmic developments outside the 
fixed-parameter setting can make use of the insights that have been 
gained in the development of depth-bounded search trees in a 
fixed-parameter setting. One example for this is the minimum quar¬ 
tet inconsistency problem arising in the construction of evolution¬ 
ary trees. Here, an algorithm that uses depth-bounded search trees 
was developed by Gramm and Niedermeier [64]. Their insight was 
used by Wu et al. [134] to develop a faster (non-parameterized) 
algorithm for this problem. 

In conclusion, depth-bounded search trees with clever branch¬ 
ing rules are certainly one of the first approaches to try when 
solving fixed-parameter tractable problems in practice. 


4 Dynamic Programming 

Dynamic programming is one of the most useful algorithm design 
techniques in bioinformatics; it also plays an important role in 
developing fixed-parameter algorithms. Since dynamic program¬ 
ming is a classic algorithm design technique covered in many 
standard textbooks [122], we keep the presentation of “fixed- 
parameter dynamic programming” short. 
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4.1 Basic Concepts 

The general idea is to recursively break down the problem into 
possibly overlapping subproblems whose optimal solution allows 
to find an overall optimal solution. The solutions to subproblems 
are stored in a table, avoiding recalculation. A classic example is 
sequence alignment of two strings, for instance, using the Needle - 
man-Wunsch algorithm [18]. The dynamic programming tech¬ 
nique, however, is not restricted to polynomial-time solvable 
problems. 

The running time of dynamic programming depends mainly on 
the table size, so the main trick in obtaining fixed-parameter 
dynamic programming algorithms is to bound the size of the 
table by a function of the parameter times a polynomial in the 
input size. Two generic methods for this are tree decompositions 
and color-coding, described in Subheadings 5 and 6, respectively. 
In many cases, however, the table size is obviously bounded in the 
parameter and thus no additional techniques are necessary to 
obtain a fixed-parameter algorithm. 

4.2 Case Study 

One application of dynamic programming is in the interpretation of 
mass spectrometry data, which contains mass peaks for a sample 
molecule and for fragments thereof [22, 27]. The method builds a 
graph where a vertex corresponds to a possible molecular formula 
of a peak, and an edge corresponds to a hypothetical fragmentation 
step. Edges are weighted by the likeliness of the corresponding 
fragmentation step. The goal is then to calculate a maximum scor¬ 
ing subtree of this graph. In this tree, we must use only one of the 
molecular formulas of a peak. This is achieved by giving each vertex 
a corresponding color and asking for a colorful subtree. 

MAXIMUM COLORFUL SUBTREE 

Input: A directed graph D = ( V, A) with a vertex coloring c: V —» C and arc 
weights w : A — > Q + . 

Task: Find a subtree of G that uses each color at most once and has 
maximum total arc weight. 

This NP-hard problem can be solved by dynamic program¬ 
ming [22, 27, 120] by building a table W(v, S ) for v E V and 
SCC. An entry W(v, S ) holds the maximum score of a subtree 
with root v whose vertex set has exactly the colors of S. The table is 
filled out with the following recurrence: 

( mZX u£V:c(u)eS\{c{v)}W(u,S\{c(v)}) + w(v, u) 

W(v, S ) = max< max (Sl , S2): Sl ns 2 = {eW} w(y> Sl) + ^ 

V Si U S 2 — s 

(1) 

with initial condition W(v, {c(v)}) = 0 and the weight of nonexis¬ 
tent arcs set to — 00 . The first line extends a tree by introducing v as 
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new root and adding the arc (f, u), and the second line merges two 
trees that have the same root but are otherwise disjoint. 

The table W has n ■ 2 k entries where k is the number of 
different vertex colors, and filling it out can be done in 0(?> k km) 
time. Thus, maximum colorful subtree is fixed-parameter tractable 
with respect to the parameter k. In the application, the parameter is 
the number of peaks in the spectrum which is usually small. 

4.3 Applications and The algorithm described in Subheading 4.2 was found to be fast 

Implementations and accurate in determining glycan structure [27] . There are several 

further applications of dynamic programming over exponentially 
sized tables. In phylogenetics, for example, the task of reconciling a 
binary gene tree with a nonbinary species trees can be solved via a 
dynamic programming algorithm whose table size is exponential 
only in the maximum outdegree of the species tree [125]. The 
implementation solves instances based on cyanobacterial gene 
trees on average in less than 1 s. In these instances, the parameter 
value ranges from 2 to 6. 

Another application of dynamic programming is in a variant of 
haplotyping (see also Subheading 7.2) which deals with the analysis 
of genomic fragments. Using dynamic programming, solutions to 
the weighted minimum error eorreetion formulation can be found in 
a running time of 0( 2 k • m). Here, k is the maximum coverage of 
any genome position by the input fragments and m is the number 
of SNPs per sequencing read [116]. The algorithm scales up to 
k « 20. 


5 Tree Decompositions of Graphs 

Many NP-hard graph problems become computationally feasible 
when they are restricted to cycle-free graphs, that is, trees or collec¬ 
tions of trees (forests ). Trees, while potentially simplifying compu¬ 
tation, form a very limited class of graphs that seldom suffices as a 
model for real-life applications. Hence, as a compromise between 
general graphs and trees, one might want to look at “tree-like” 
graphs. This tree-likeness can be formalized by the concept of tree 
deeompositions. In this section, we survey some important aspects of 
tree decompositions and their algorithmic use with respect to 
computational biology and FPT. Surveys on this topic are given 
by Berger et al. [11] and Bodlaender and Koster [29]. 

5.1 Basic Concepts There is a very helpful and intuitive characterization of tree decom¬ 
positions in terms of a robber-cop game in a graph [28]: A robber 
stands on a graph vertex and, at any time, he can run at arbitrary 
speed to any other vertex of the graph as long as there is a path 
connecting both. The only restriction is that he is not permitted to 
run through a cop. There can be several cops and, at any time, each 
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of them may either stand on a graph vertex or be in a helicopter 
(that is, she is above the game board and can move anywhere 
without being restricted by graph edges). The cops want to land a 
helicopter on the vertex occupied by the robber. The robber can see 
a helicopter approaching its landing vertex and he may run to a new 
vertex before the helicopter actually lands. Thus, the cops want to 
occupy all vertices adjacent to the robber’s vertex, making him 
unable to move, and to then land one more remaining helicopter 
on the robber’s vertex itself to catch him. The treewidth of the 
graph is the minimum number of cops needed to catch a robber 
minus one (observe that if the graph is a tree, two cops suffice and 
trees hence have a treewidth of one) and a corresponding tree 
decomposition is a tree structure that provides the cops with a 
scheme to catch the robber. Intuitively, the tree decomposition 
indicates “bottlenecks” (separators) in the graph and thus reveals 
an underlying scaffold that can be exploited algorithmically. 

Formally, tree decompositions and treewidth center around the 
following somewhat technical definition; Fig. 8 shows a graph 
together with an optimal tree decomposition of width two. 

Definition 3. Let G = (V, T) be an undirected graph. A tree 
decomposition ofG is a pair ({X^i E I}, T) where each Xi is a subset 
ofW , called a bag, and T is a tree with the elements of I as nodes. The 
following three properties must hold: 

1. U Xi — V; 

iel 

2. for every edge {u, v} E E, there is an i E I such that {u, v} C X b 
and 

3. for all i, j, k E I, if] lies on the path betweeni andk inT , then Xj Pi 

X k c x r 

The width of ({X^i E I}, T) equals max{\ X, | |i E 1} — 1. The 
treewidth of G is the minimum k such that G has a tree decomposition 
of width k. 



Fig. 8 A graph together with a tree decomposition of width 2. Observe that—as demanded by the consistency 
property—each graph vertex induces a subtree in the decomposition tree 
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The third condition of the definition is often called consistency 
property. It is important in dynamic programming, the main algo¬ 
rithmic tool when solving problems on graphs of bounded tree- 
width. An equivalent formulation of this property is to demand that 
for any graph vertex r, all bags containing v form a connected 
subtree. 

For trees, the bags of a corresponding tree decomposition are 
simply the two-element vertex sets formed by the edges of the tree. 
In the definition, the subtraction of 1 thus ensures that trees have a 
treewidth of 1. In contrast, a clique of ^vertices has treewidth n— 1. 
The corresponding tree decomposition trivially consists of one bag 
containing all graph vertices; in fact, no tree decomposition with 
smaller width is attainable since it is known that every complete 
subgraph of a graph Gis completely “contained” in a bag of G’s tree 
decomposition. 

Tree decompositions of graphs are connected to another cen¬ 
tral concept in algorithmic graph theory: jymph separators are 
vertex sets whose removal from the graph separates the graph into 
two or more connected components. Each bag of a tree decompo¬ 
sition forms a separator of the corresponding graph. 

Given a graph, determining its treewidth is an NP-hard prob¬ 
lem itself. However, several tools and heuristics exist that construct 
tree decompositions [29-31], and for some graphs that appear in 
practice, computing a tree decomposition is easy. Here, we concen¬ 
trate on the algorithmic use of tree decompositions, assuming that 
they are provided to us. 

5.2 Case Study Typically, tree decomposition-based algorithms have two stages: 

1. Find a tree decomposition of bounded width for the input 
graph. 

2. Solve the problem by dynamic programming on the tree decom¬ 
position, starting from the leaves. 

Intuitively speaking, a decomposition tree provides us with a 
scaffold-structure that allows for efficient and consistent processing 
through the graph. By design, this scaffold leads to optimal solu¬ 
tions even when the utilized tree decompositions are not optimal; 
however, the algorithm will run slower and consume more memory 
in that case. 

To exemplify dynamic programming on tree decompositions, 
we make use of our running example vertex cover and sketch a 
fixed-parameter dynamic programming algorithm for vertex cover 
with respect to the parameter treewidth. 

Theorem 2. For a graph G with a given width-co tree decomposition 
({X*|i E I},T), an optimal vertex cover can be computed in 0(2^ -oo * |I|) 
time. 
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The basic idea of the algorithm is to examine for each bag X* all 
of the at most 2^ possibilities to obtain a vertex cover for the 
subgraph G[XJ. This information is stored in tables i E I. 
Adjacent tables are updated in a bottom-up process starting at the 
leaves of the decomposition tree. Each bag of the tree decomposi¬ 
tion thus has a table associated with it. During this updating process 
it is guaranteed that the “local” solutions for each subgraph asso¬ 
ciated with a bag of the tree decomposition are combined into a 
“globally optimal” solution for the overall graph G. (We omit 
several technical details here; these can be found in [109, Chap¬ 
ter 10].) The following points of Definition 3 guarantee the validity 
of this approach: 

1. The first condition in Definition 3, that is, V = IJ X*, makes 

iel 

sure that every graph vertex is taken into account during the 
computation. 

2. The second condition in Definition 3, that is, 
Me E E 3i E I : e E X/, makes sure that all edges can be treated 
and thus will be covered. 

3. The third condition in Definition 3 guarantees the consistency 
of the dynamic programming, since information concerning a 
particular vertex v is only propagated between neighboring bags 
that both contain v. 

While the running time of the dynamic programming part can 
often be improved over a naive approach, there is evidence that 
known algorithms for some basic combinatorial problems are 
essentially optimal [101]. 

One thing to keep in mind for a practical application is that 
storing dynamic programming tables requires memory space that 
grows exponentially in the treewidth. Hence, even for “small” 
treewidths, say, between 10 and 20, the computer program may 
run out of memory and break down. Some techniques for limiting 
memory use have been proposed [12, 55, 70]. 

5.3 Applications and 
Implementations 


Tree decomposition-based algorithms are a valuable alternative 
whenever the underlying graphs have small treewidth. As a rule of 
thumb, the typical border of practical feasibility lies somewhere 
below a treewidth of 20 for the underlying graph, although with 
advantageous data and careful implementation higher values are 
possible (e.g.,[70]). Successful implementations for solving vertex 
cover with tree decomposition approaches have been 
reported [4, 12]. 

A practical application of tree decompositions is found in pro¬ 
tein structure prediction, namely the prediction of backbone struc¬ 
tures and side-chain prediction. These two problems can be 
modeled as a graph labeling problem, where the resulting graphs 
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have a very small treewidth in practice, allowing the problems to be 
solved efficiently [11]. 

Besides taking an input graph, computing a tree decomposition 
for it, and hoping that the resulting tree decomposition has small 
treewidth, there have also been cases where a problem is modeled as 
a graph problem such that it can be proven that the resulting graphs 
have a tree decomposition with small treewidth that can efficiently 
be found. As an example, Song et al. [123] used a so-called confor¬ 
mational graph to specify the consensus sequence-structure of an 
RNA family. They proved that the treewidth of this graph is basi¬ 
cally determined by the structural elements that appear in the RNA. 
More precisely, they showed that if there is a bounded number of 
crossing stems, say &, in a pseudoknot structure, then the resulting 
graph has treewidth (2 + k). Since the number of crossing stems is 
usually small, this yields a fast algorithm for searching RNA second¬ 
ary structures (see also [135]). 

Other biological applications include peptide sequencing and 
spectral alignment [100], molecule bond multiplicity infer¬ 
ence [26], charge group partitioning for biomolecular simula¬ 
tions [38], and NMR interpretation [98]. The idea of exploiting 
the treewidth of an auxiliary structure describing interdependencies 
of the input also has attracted much attention in artificial intelli¬ 
gence (Al) applications [62, 85]. 

Besides dynamic programming, a very powerful method to 
obtain fixed-parameter results for the parameter treewidth is to 
cast the problem as an expression in monadie seeond-order lopfie 
(MSO) [97]. For example, for vertex cover, the expression is 

vc(U) := Vv,y E V : -■ ({x,y} E E )Vv E UVy E U. 

Since the worst-case running time obtained from this formulation 
is extremely bad, this approach was thought to be impractical [109, 
Chapter 10]. However, recently a solver was presented that indeed 
just requires the user to provide the MSO expression [88, 96, 97]. 
If the problem at hand admits a formulation in MSO (as most 
problems that are fixed-parameter tractable for treewidth do), this 
provides a quick way to evaluate the feasibility of the treewidth 
approach for the data at hand, with the option to get a quicker 
algorithm by designing a customized dynamic programming. 

Besides treewidth, a number of alternative concepts have been 
developed to compare the structure of a graph to a tree, including 
branch-width, rank-width, and hypertree-width [78, 97]. 


6 Color-Coding 


The color-coding technique due to Alon et al. [5] is a general 
method for finding small patterns in graphs. In its simplest form, 
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color-coding can solve the minimum-weight path problem, which 
asks for the cheapest path of length k in a graph. This has been 
successfully employed with protein-protein interaction networks to 
find signaling pathways [82, 120] and to evaluate pathway similar¬ 
ity queries [121]. 

6.1 Basic Concepts A naive approach to discover a small structure of k vertices within a 

graph of n vertices would be to combinatorially try all of the 
roughly n k possibilities of selecting k out of n vertices and then 
testing the selection for the desired structural property. This 
approach quickly leads to a combinatorial explosion, making it 
infeasible even for rather small input graphs of a few hundred 
vertices. The central idea of color-coding is to randomly color 
each vertex of a graph with one of k colors and to hope that all 
vertices in the subgraph searched for obtain different colors (that is, 
the vertex set becomes colorful). 

When the structure that is searched for becomes colorful, the 
task of finding it can be solved by dynamic programming in a 
running time where the exponential part solely depends on fe, the 
size of the substructure searched for. Of course, given the random¬ 
ness of the initial coloring, most of the time the target structure will 
actually not be colorful. Therefore, we have to repeat the process of 
random coloring and searching (called a trial) many times until the 
target structure is colorful at least once with sufficiently high prob¬ 
ability. As we will show, the number of trials also depends only on k 
(albeit exponentially). Consequently this algorithm has a fixed- 
parameter running time. Thus it is much faster than the naive 
approach which needs 0(n k ) time. 

6.2 Case Study Formally stated, the problem we consider is the following: 

MINIMUM-WEIGHT PATH 

Input: An undirected graph G with edge weights w : E —» Q + and a 
nonnegative integer k. 

Task: Find a simple length- & path in G that minimizes the sum over its edge 
weights. 

This problem is well known to be NP-hard [61, ND29]. What 
makes the problem hard is the requirement of simple paths, that is, 
paths where no vertex may occur more than once (otherwise, it is 
easily solved by traversing a minimum-weight edge k — \ times). 

Given a fixed coloring of vertices, finding a minimum-weight 
path that is colorful can be accomplished by dynamic program¬ 
ming: Assume that for some i < k we have computed a value W 
(r, S) for every vertex v E V and every cardinality T subset S of 
vertex colors such that W(r, S) denotes the minimum weight of a 
path that uses each color in S exactly once and ends in v. Clearly, the 
resulting path is simple because no color is used more than once. 
We can now use this to compute the values W(v, S) for all cardinal¬ 
ity- (i + 1) subsets S and vertices v E V y because any colorful 
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V3 


V3 


w(v 2,{#,0,0}) = 5 w> 3 ,{«,0,0}) = 3 = 

mm{W(v 2 , {#, #, O}) + 2, 

W r («s.{#.0,0}) + 3} = 6 


Fig. 9 Example for solving minimum-weight path using the color-coding technique. Here, using (2) a new table 
entry (right) is calculated using two already known entries (left and middle) 


length- (i + 1 ) path that ends in a vertex v 6 V must be composed of 
a colorful length -i path that does not use the color of v and ends in 
a neighbor of v. 

More precisely, we let 

W(v,S)— min ( W(«, S\{color (p)}) + w(e)\ (2) 

e={u^ v}eE x 7 

See Fig. 9 for an example. 

It is straightforward to verify that on an m-c dge graph the 
dynamic programming takes 0(2 k m) time. Whenever the 
minimum-weight length- k path P in the input graph is colored 
with k different colors (that is, its vertex set is colorful), then the 
algorithm finds P. The problem, of course, is that the coloring of 
the input graph is random and hence many coloring trials have to 
be performed to ensure that the minimum-weight path is found 
with a high probability. More precisely, the probability of any 
length-£ path (including the one with minimum weight) being 
colorful in a single trial is 

Pc = —u> V2^ke~ k (3) 

k k V 1 

because there are k k ways to arbitrarily color k vertices with k colors 
and k\ ways to color them such that no color is used more tha- 
n once. Using t trials, a path of length k is found with probability 
1 — (1 — P c ) f . Therefore, to ensure that a colorful path is found 
with a probability greater than 1 — e (for any 0 < e < 1 ), at least 


t(e) = 


Inc 

ln(l - P c ) 


—lne ■ 0(e k ) 


( 4 ) 


trials are needed. This bounds the overall running time by 2 a ' k) ■ n* 1 
While the result is only correct with a certain probability, we 
can specify any desired error probability, say 0.1 %, noting that 
even very low error probabilities do not incur excessive extra running 
time costs. 
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6.3 Applications 
and Implementations 


Note that the number of colors chosen poses a trade-off: While 
using more than k colors increases the chance of a target structure 
becoming colorful—and thus decreases the number of trials needed 
to achieve a given error probability—it increases the running time 
and memory requirements of the dynamic programming step. As a 
theoretical analysis points out, using 1.3 k colors instead of just k 
improves the worst-case running time of the color-coding algo¬ 
rithm. Moreover, in practice it is often beneficial to increase the 
number of colors even further [82]. 

Protein interaction networks represent proteins by vertices and 
mutual protein-protein interaction probabilities by weighted 
edges. They are a valuable source of information for understanding 
the functional organization of the proteome. Scott et al. [120] 
demonstrated that high-scoring simple paths in the network consti¬ 
tute plausible candidates for linear signal transduction pathways, 
simple meaning that no vertex occurs more than once and high- 
scoring meaning that the product of edge weights is maximized. To 
match the above definition of minimum-weight path, one works 
with the weight w(e): = —lo gp(e) of an edge e with interaction 
probability p{e) between e*s endpoints. Then minimizing the sum 
of the weights is equivalent to maximizing the product of the 
probabilities. 

The currently most efficient implementation based on color¬ 
coding [82] is capable of finding optimal paths of length up to 20 in 
seconds within a yeast protein interaction network containing 
about 4 500 vertices. 

A particularly appealing aspect of color-coding is that it can be 
easily adapted to many practically relevant variations of the problem 
formulation: 

• The set of vertices where a path can start and end can be 
restricted (such as to force it to start in a membrane protein 
and end in a transcription factor [120]). 

• Not only the minimum-weight path can be computed but rather 
a collection of low-weight paths (typically, one demands that 
these paths must differ in a certain amount of vertices to ensure 
that they are diverse and not small modifications of the global 
minimum-weight path) [82]. 

• More generally, pathway queries to a network, that is, the task of 
finding a pathway in a network that is as similar as possible to a 
query pathway, can be handled with color-coding [121]. 

Several other works use color-coding for querying in protein 
interaction networks. For example, the queries can be trees, allow¬ 
ing for identification of non-exact (homeomorphic) matches [52]. 
Another application is counting non-induced occurrences of sub¬ 
graph topologies in the form of trees and bounded treewidth 
subgraphs [6]. 
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A further use of color-coding is to solve the graph motif 
problem. In a biological application of graph motif, the query is a 
set of proteins, and the task is to find a matching set of proteins that 
are sequence-similar to the query proteins and span a connected 
region of the network. Bruckner et al. [33] and Betzler et al. [13] 
provided implementations based on color-coding; they differ in the 
way insertions and deletions are handled, and are thus not directly 
comparable. 

Further, color-coding has also found applications in string 
problems: for example, Bonizzoni et al. [32] used it to solve a 
variant of longest common subsequence that is motivated by 
a sequence comparison problem. However, to the best of our 
knowledge no string algorithm using color-coding has been imple¬ 
mented yet. 

6.3.1 Related We mention some techniques that use ideas similar to color-coding. 

Techniques To the best of our knowledge, with one exception none of them has 

been implemented so far. 

Two variants use only two colors to separate the pattern from 
surrounding vertices (random separation) [37] or to divide the 
graph into two parts for recursion {divide-and-color) [87]. Random 
separation can be used to find small subgraphs with desired proper¬ 
ties in sparse graphs. For these problems enumerating connected 
subgraphs and using color-coding [91] sometimes gives faster 
algorithms. A further extension known as chromatic coding was 
used to obtain (theoretically) fast algorithms for the dense triplet 
inconsistency problem motivated from phylogenetics [72]. 

Algebraic techniques [92, 93] can improve on the worst-case 
running time of many color-coding approaches; for example, the 
currently strongest worst-case bound for graph motif is obtained 
this way [16]. This approach, however, is not as flexible as color¬ 
coding, for example, with respect to the handling of large weights. 
Experiments for the unweighted version of minimum weight path 
on random graphs have shown that the approach is feasible for a 
path length of 16 and 8000 vertices [17]. 


7 Iterative Compression 

The main idea of iterative compression is induction: we construct a 
slightly smaller instance, solve it recursively, and then make use of 
the solution to solve the actual instance. While induction is a classic 
algorithmic approach, iterative compression first appeared in a work 
by Reed et al. in 2004 {see also a 2009 survey [75]). Although it is 
perhaps not quite as generally applicable as data reduction or search 
trees, it appears to be useful for solving a wide range of problems 
and has led to significant breakthroughs in showing fixed- 
parameter tractability results. Iterative compression is typically 
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used for “minimum obstruction deletion” problems: given a set of 
items, omit the minimum number of items such that the remaining 
items exhibit some “nice” structure. Thus, it can sometimes model 
parsimonious error correction, vertex cover is one example fitting 
this scheme, and it can be solved with iterative compression [117]. 

7.1 Basic Concepts The central concept of iterative compression is to employ a so- 

called compression routine. 

Definition 4. A compression routine is an algorithm that, given a 
problem instance and a solution of size k, either calculates a smaller 
solution or proves that the given solution is of minimum size. 

With a compression routine, we can find an optimal solution 
for an instance by recursively solving a smaller instance, using the 
solution for the smaller instance to find a possibly suboptimal 
solution for the actual instance, and then using the compression 
routine to find an optimal solution. For “minimum obstruction 
deletion” problems, the only nontrivial step is the compression 
routine. 

The main strength of iterative compression is that it allows us to 
see a problem from a different angle, since the compression routine 
does not only have the problem instance as input, but also a 
solution, which carries valuable structural information on the 
input. Also, the compression routine does not need to find an 
optimal solution at once, but only any better solution. Therefore, 
the design of a compression routine can often be simpler than 
designing a complete algorithm. 

Algorithmically, the compression routine is the “complex” step 
in iterative compression in two regards: First, while the mode of use 
of the compression routine is usually straightforward, finding the 
compression routine itself often is not. Second, if the compression 
routine is a fixed-parameter algorithm with respect to the parame¬ 
ter £, then so is the whole algorithm. 

7.2 Case Studies The showcase for iterative compression is the vertex bipartization 

problem, also known as odd cycle cover. 

VERTEX BIPARTIZATION 

Input: An undirected graph G = ( V, E) and a nonnegative integer k. 

Task: Find a set D C Vof at most k vertices such that G[ V\D] is bipartite. 

This problem appears as minimum fragment removal in the 
context of SNP haplotyping [112]. When analyzing DNA frag¬ 
ments obtained by shotgun sequencing, it is initially unknown 
which of the two chromosome copies of a diploid organism a 
fragment belongs to. We can, however, determine for some pairs 
of fragments that they cannot belong to the same chromosome 
copy since they contain conflicting information at some SNP locus. 
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7.3 Applications 
and Implementations 



Fig. 10 A vertex bipartization instance (left), and an optimal solution (right): when 
deleting two fragments (dashed), the remaining fragments can be allocated to 
the two chromosome copies (A and B) such that no conflicting fragments get the 
same assignment 


Using this information, it is straightforward to reconstruct the 
chromosome assignment. We can model this as a graph problem, 
where the fragments are the vertices and a conflict is represented as 
an edge. The task is then to color the vertices with two colors such 
that no vertices with the same color are adjacent. The problem gets 
difficult in the presence of errors such as parasite DNA fragments 
which randomly conflict with other fragments. In this scenario, we 
ask for the least number of fragments to remove such that we can 
get a consistent fragment assignment (see Fig. 10). Using the num¬ 
ber of fragments k to be removed as a parameter is a natural 
approach, since the result is only meaningful for small k anyway. 

Iterative compression provided the first fixed-parameter algo¬ 
rithm for vertex bipartization with this parameter [118]. We 
sketch how to apply this to finding an optimal solution (a removal 
set) for a vertex bipartization instance (G = ( V, £), k). Choose an 
arbitrary vertex v and let G be G with v deleted. Recursively find 
an optimal removal set R' for G (this recursion terminates after 
n — | V | steps, where we can yield the empty removal set for the 
empty graph). Clearly, R' U{ v) is a removal set for G, although it 
might not be optimal (it can be too large by one). Now using the 
compression routine for G and R' U{ p}, we can find an optimal 
solution for G'. 

The compression routine itself works by examining a number 
of vertex cuts in an auxiliary graph (that is, a set of vertices whose 
deletion makes the graph disconnected), a task which can be 
accomplished in polynomial time by maximum flow techniques. 
We refer to the literature for details [80, 95, 118]. The running 
time of the complete algorithm is 0(3* • mn) [80]. 

The iterative compression algorithm for vertex bipartization has 
been employed for a number of biological applications. An imple¬ 
mentation, improved by heuristics, can solve all minimum fragment 
removal problems from a testbed based on human genome data 
within minutes, whereas established methods are only able to solve 
about half of the instances within reasonable time [80]. The unor- 
dered maximum tree orientation, which models inference of signal 
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transmissions in protein-protein interaction networks based on 
cause-effect pairs, can be reduced to graph bipartization [21]. 
Also, ordering and orienting contigs produced during genome 
assembly can be reduced to graph bipartization, and this is imple¬ 
mented in the SCARPA scaffolder [51]. Recently, an algorithm 
with a better worst-case bound of 2.32^ • n°^ based on linear 
programming was presented [102], which seems like a promising 
alternative to iterative compression for vertex bipartization. 

edge bipartization, the edge deletion version of vertex biparti¬ 
zation, can also be solved by iterative compression [74]. Enhanced 
with data reduction rules and generalized to the signed graph bal¬ 
ancing problem, this algorithm was used to analyze gene regu¬ 
latory networks [83]. It can solve many networks to optimality, but 
fails for the largest ones [83]. The tanglegram layout problem is 
about drawing two phylogenetic trees on the same species set in 
order to facilitate analysis; it can be reduced to edge bipartiza¬ 
tion [24]. The implementation by Hiiffner et al. [83] can find 
exact solutions for all practically relevant tanglegram layout 
instances within seconds [24]. Finally, computing the minimum 
number of recombination events for general pedigrees with 
two sites for all members can also be reduced to edge 
BIPARTIZATION [49]. 

Another prominent problem amenable to iterative compression 
is feedback vertex set, which also has applications for genetic 
linkage analysis [10]. While initial algorithms based on iterative 
compression [47, 74] had prohibitive worst-case running times, 
the currently fastest known approach runs in 3.619^ • n°^ time for 
finding a feedback vertex set of k vertices [89]. However, these 
algorithms have not been implemented yet. 

The directed feedback vertex set problem was also shown to 
be fixed-parameter tractable by iterative compression [42], solving 
a long-standing open question. However, the worst-case running 
time bound is much worse than for the previously mentioned 
problems. Still, an experimental evaluation on random graphs [58], 
employing also data reduction, showed encouraging results for very 
small parameter values, directed feedback vertex set has applica¬ 
tions in pairwise genome alignment under the duplication-loss 
model [50] and in the comparison of gene orders [71]. For an 
application in reconstructing reticulation networks in particular, 
the authors mention that the parameter could be expected to be 
very small [99]. 

Finally, the cluster vertex deletion problem, the “vertex dele¬ 
tion variant” of cluster editing, aims to cluster objects by remov¬ 
ing objects that do not fit in the cluster structure. It can also be 
solved by a fixed-parameter algorithm with respect to the number 
of removed vertices using iterative compression [84]. 
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8 A Roadmap Towards Efficient Implementations 


8.1 Identification of 
Parameters 

Here we try to give some general recommendations on how to go 
about applying parameterized algorithmics to NP-hard computa¬ 
tional problems in practice. 

The first task is to identify fruitful parameters. As detailed in Sub¬ 
heading 1.2, it is useful to consider several “structural” parameters, 
possibly also deduced from a data-driven analysis of the input 
instances. The usefulness of the parameter clearly depends on 
whether it is small in the input instances. For graph instances, a 
tool such as Graphana (http://fpt.akt.tu-berlin.de/graphana/) 
that calculates a wide range of graph parameters can be helpful. At 
this point, it is also useful to determine whether the problem is 
fixed-parameter tractable or W[l]-hard. While a hardness result 
encourages to look for another parameter or combined parameters, 
bear in mind that certain techniques such as data reduction can still 
be effective in practice even without a performance guarantee. 

8.2 Implementation 
of Brute-Force Search 

The next thing to do is to implement a brute-force search that is as 
simple as possible. There are several reasons for this: First, it gives 
some first impression on what solutions look like (for example, can 
we use their size as parameter?). Second, a simple starting imple¬ 
mentation is invaluable in shaking out bugs from later, more 
sophisticated implementations, in particular if results for random 
instances are systematically compared. Possibly the best way to get a 
simple brute-force result is to use an integer linear program (ILP). 
These sometimes need only a few lines when using a modeling 
language, but are often surprisingly effective. The second method 
of choice is a simple search tree (Subheading 3). 

8.3 Implementation 
of Data Reduction 

Data reduction is valuable in combination with any other algorith¬ 
mic technique such as approximation, heuristics, or fixed-parameter 
algorithms. In some cases it can even completely solve instances 
without further effort; it can be considered as essential for the 
treatment of NP-hard problems. Thus it should always be the first 
nontrivial technique to be developed and implemented. When 
combined with even a naive brute-force approach, it can often 
already solve instances of notable size. For large instances, an 
efficient implementation of the data reduction rules is necessary. 
A rule of thumb is to aim for linear running time for most of the 
implemented data reduction rules and to apply linear-time data 
reduction rules first [126]. 

8.4 Tuned Search 

Trees 

After this, the easiest speedups typically come from a more carefully 
tuned search tree algorithm. Case distinction can help to improve 
provable running time bounds, although it has often been reported 
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8.5 Non-traditional 
Techniques 


8.6 Heuristic 
Speedups 


9 Conclusion 


that a too complicated branching actually leads to a slowdown. 
Heuristic branching priorities can help, as well as admissible heuris¬ 
tic evaluation functions [57]. Further, interleaving with data reduc¬ 
tion can lead to a speedup [111]. 

When search trees are not applicable or too slow, less clear instruc¬ 
tions can be given. The best thing to do is to look at other fixed- 
parameter algorithms and techniques for inspiration: are we look¬ 
ing for a small pattern in the input? Possibly color-coding (Sub¬ 
heading 6) helps. Are we looking for minimum modifications to 
obtain a nice combinatorial structure? Possibly iterative compres¬ 
sion (Subheading 7) is applicable. In this way, using some of the less 
common approaches of fixed-parameter algorithms, one might still 
come up with a fixed-parameter algorithm. 

Here, one should be wary of exponential-space algorithms as 
these can often fill the memory within seconds and therefore 
become unusable in practice. In contrast, one should not be too 
afraid of bad upper bounds for fixed-parameter algorithms—the 
analysis is worst-case and often much too pessimistic. 

Some of the largest speedups experienced in experiments come 
from techniques that can be considered heuristic in the sense that 
they do not improve worst-case time bounds or the kernel size. The 
general idea of most heuristics is to recognize early that some 
branches or subcases cannot lead to an optimal solution and to 
skip those. Their potential effectiveness, even when no perfor¬ 
mance guarantees can be given, should always be kept in mind 
when implementing algorithms. 

Furthermore, most algorithms will have numerous degrees of 
freedom concerning their actual implementation, execution order, 
and the value of some thresholds, for example, concerning the 
fraction of search tree nodes to which data reduction should be 
applied. There are tools for algorithm configuration that can exploit 
this freedom and may yield magnitudes of speedup [79]. 


We surveyed several techniques for developing efficient fixed- 
parameter algorithms for computationally hard (biological) pro¬ 
blems. Since many of these problems appear to “carry small para¬ 
meters,” we firmly believe that there will continue to be a strong 
interaction between parameterized complexity analysis and algo¬ 
rithmic bioinformatics. To make this as fruitful as possible, it is 
necessary to analyze real-world data in search for “hidden struc¬ 
ture” which can be captured by suitable parameterizations. A 
subsequent parameterized complexity analysis can then determine 
which of these parameterizations yield field-parameter algorithms. 
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This data-driven line of algorithmic research is still underdeveloped 
and should receive increased attention in future research. More¬ 
over, in order to obtain the practically most useful algorithms, it 
may often be good to combine fixed-parameter algorithms (partic¬ 
ularly, data reduction and kernelization) with general-purpose tools 
for solving computationally hard problems, including SAT solving 
and integer linear programming. This certainly will need a lot of 
experimentation going far beyond purely theoretical algorithm 
design. 


10 Notes 


1. To show that a problem is unlikely to be fixed-parameter tracta¬ 
ble, the concept of W[l]-hardness was developed. It is widely 
assumed that a W[l]-hard problem cannot have a fixed- 
parameter algorithm (W[f]-hardness, t > 2 has the same impli¬ 
cation). For example, the clique problem to find a clique (com¬ 
plete subgraph) in an undirected graph is W[l]-hard with 
respect to the parameter “number of vertices in the clique.” 
To show that a problem is W[l]-hard, a parameterized 
reduction from a known W[l]-hard problem can be used (see, 
e.g., [41, 54]). 

2. There exist suitable data reduction rules when it is of interest to 
enumerate ^//minimal vertex covers of a given graph. For exam¬ 
ple, Damaschke [46] suggests the notion of a full kernel that 
contains all minimal solutions in a compressed form and thereby 
allows enumeration of them. 

3. One technique to show that a polynomial kernel is unlikely is 
called composition [53, 94]. A composition is an algorithm that 
combines the inputs of many instances of a problem into one 
“equivalent” instance. For 2-club, the composition is to take the 
disjoint union of the input graphs of the instances: Any solution 
to such a combined instance has to live completely inside one of 
its connected components, which are completely contained in 
one of the original input instances. Thus, the combined instance 
has a solution if and only if at least one of the input instances has 
one. The existence of a composition and a polynomial kerneliza¬ 
tion leads to an implausible complexity-theoretic collapse. Thus, 
it is widely assumed that there is no polynomial problem kernel 
for problems with a composition [53, 94]. 
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Chapter 21 


Information Visualization for Biological Data 

Tobias Czauderna and Falk Schreiber 


Abstract 

Visualization is a powerful method to present and explore a large amount of data. It is increasingly 
important in the life sciences and is used for analyzing different types of biological data, such as structural 
information, high-throughput data, and biochemical networks. This chapter gives a brief introduction to 
visualization methods for bioinformatics, presents two commonly used techniques in detail, and discusses a 
graphical standard for biological networks and cellular processes. 

Keywords Visualization, Data exploration, Heat-maps, Force-based layout, Graph drawing, Systems 
Biology Graphical Notation 


1 Introduction 


Visualization is the transformation of data, information or knowl¬ 
edge into a visual form such as images and maps. It uses the human 
ability to take in a large amount of data in a visual form and to 
detect trends and patterns in pictures easily. Visualization is a 
helpful method to analyze and explore data or to communicate 
information; a fact expressed by the common proverb “A picture 
speaks a thousand words.” The visual representation of data or 
knowledge is not a particularly modern technique, but rather is as 
old as human society. Rock engravings and cave images can be seen 
as an early form of visual communication between humans. Molec¬ 
ular biologica information has also been represented visually for a 
long time. Well-known examples are illustrations in books, such as 
molecular structures (e.g., DNA and other molecules) or biological 
processes (e.g., cell cycle, metabolic pathways). Most of the other 
chapters in this book use visualizations to illustrate concepts or 
present information. 

Nowadays visualization is an increasingly important method in 
bioinformatics to present very diverse information. Structural 
information of molecules can be shown in 2D (structural formulae 
of substances) and 3D space [1-3]. Genome and sequence 
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annotation is often displayed in linear or circular representations 
with additional annotations [4-6]. Expression and metabolite pro¬ 
files are high-dimensional data which can be visualized with tech¬ 
niques such as bar-charts, line-graphs, scatter-plot matrices [7], 
parallel coordinates [8], heat-maps [9], and tree-maps [10]. 
There are several methods to visualize hierarchical structures 
(e.g., phylogenetic trees) [11-13] and biochemical networks 
(e.g., metabolic pathways) [14-16]. Typical examples of visualiza¬ 
tions in bioinformatics are shown in Fig. 1, overviews of visualiza¬ 
tion methods are given, for example, in the “Points of view” series 
in Nature Methods and for omics data in [20]. 

This chapter presents heat-maps and foree-based network layout 
in detail and introduces the Systems Biology Graphical Notation, a 
graphical standard for biological networks and cellular processes. 

Heat-maps are a standard method to visualize and analyze 
large-scale data obtained by the high-throughput technologies dis¬ 
cussed in previous chapters. These technologies lead to an ever- 
increasing amount of molecular-biological data, deliver a snapshot 
of the system under investigation, and allow the comparison of a 
biological system under different conditions or in different devel¬ 
opmental stages. Examples include gene expression data [21], pro¬ 
tein data [22], and the quantification of metabolite concentrations 
[23]. A typical visualization of such data using a heat-map is shown 
in Fig. 2. 

Foree-based network layout is the main method used to visualize 
biological networks. Biological networks are important in 
bioinformatics; see also Section VI (Pathways and Networks). 
Biological processes form complex networks such as metabolic 
pathways, gene regulatory networks, and protein-protein interac¬ 
tion networks. Furthermore, the data obtained by high- 
throughput methods and biological networks are closely related. 
There are two common ways to interpret experimental data: (1) as a 
biological network and (2) in the context of an underlying 
biological network. A typical example of the first interpretation is 
the analysis of interactomics data, for example, data from two- 
hybrid experiments [25]. The result of these experiments is infor¬ 
mation as to whether proteins interact pairwise with each other or 
not. Taking many different protein pairs into account, a protein- 
protein interaction network can be directly derived. An example of 
the second interpretation is the analysis of metabolomics data, such 
as data from mass spectrometry based metabolome analysis [23]. 
These experiments give, for example, time series data for different 
metabolites, which can be mapped onto metabolic networks and 
then analyzed within the network context. A visualization of a 
network using force-based layout is shown in Fig. 3. There are 
many extensions to force-based layout algorithms such as 
extra forces [26, 27], animation [28], and the consideration of 
mapped data. 
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Fig. 1 Examples of visualizations in bioinformatics (from fop to bottom ): 3D structure of a molecule (produced 
with Molw PDB Viewer [17]), scatter-plot matrix of metabolite profiling data of different lines of an organism 
(produced with VANTED [18]), layout of a metabolic pathway (produced with BioPath [19]), and line-graph of 
time series data of the concentration of a metabolite (produced with VANTED [18]) 

Systems Biology Graphical Notation (SBGN) [30] is a standard 
for the graphical representation of biological networks and cellular 
processes. It provides an unambiguous and uniform way to present 
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Fig. 2 A heat-map showing the expression of genes under eight different 
conditions (produced with R [24]) 



Fig. 3 A picture of a protein-protein interaction network based on the force-based layout method (produced 
with CentiBin [29]) 
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Fig. 4 A SBGN map showing the first steps of the metabolic pathway glycolysis with additional information. 
The circles, rectangles, and rectangles with rounded corners represent simple chemicals (metabolites), 
processes (reactions), and macromolecules (enzymes), respectively. Additional information given as diagrams 
within simple chemicals and macromolecules represents metabolite measurements and activity of genes 
related to enzymes (produced with SBGN-ED [40]) 


information, thereby reducing the risk of misinterpreting maps of 
biological processes and supporting faster information exchange. 
SBGN provides three corresponding views of a biological system 
focusing on different aspects and levels of detail: Process Descrip¬ 
tion maps describe elements and processes of biological systems 
[31], Entity Relationship maps focus on interactions between 
biological entities [32], and Activity Flow maps describe informa¬ 
tion flow between biological activities [33]. Several databases pro¬ 
vide information in SBGN, for example, Reactome [34], 
PANTHER Pathways [35], BioModels Database including Path2- 
Models [36, 37], MetaCrop [38], and RIMAS [39]. A SBGN map 
is shown in Fig. 4. 


2 Methods 

2.1 Heat-maps High-throughput data is often represented by a two-dimensional 

matrix M. Usually the rows represent the measured entities (e.g., 
expression of genes) and the columns represent the different sam¬ 
ples (e.g., different time points, environmental conditions or genet¬ 
ically modified lines of an organism). To show patterns in the data it 
is often useful to rearrange the rows and/or columns of the matrix 
so that similar rows (columns) are close to each other, for example, 
to place genes with similar expression patterns close together. 

A heat-map is a two-dimensional, colored grid of the same size 
as the matrix M where the color of each place is determined by the 
corresponding value of the matrix as shown in Fig. 5. 

For a given matrix M, the algorithm to produce a heat-map is as 
follows: 
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Fig. 5 (Left) A heat-map of the data set in the given order (x-axis: conditions, y-axis: genes). (Right) 
Rearrangement of columns (conditions) and dendrogram showing a hierarchical clustering of the different 
conditions (conditions with similarly expressed genes are close together, produced with R [24]) 


1. (Optional) Rearrange the rows of the matrix as follows: Compute 
a distance matrix containing the distance between each pair of 
rows (consider each row as a vector). There are several possible 
distance measures (e.g., Euclidean distance, Manhattan distance 
and correlation coefficient). Based on the distance matrix, either 
rearrange the rows directly such that neighboring rows have only 
a small distance, or compute a hierarchical clustering (using one 
of the various methods available, such as complete linkage or 
single linkage). Rearrange the rows such that a crossing-free 
drawing of the tree representing the hierarchical clustering is 
obtained and similar rows are close together. Details of this 
rearranging step and several variations can be found in Chapter 
54 (Combinatorial optimization models for finding genetic sig¬ 
natures from gene expression datasets) and in [7, 41, 42]. 

2. (Also optional) Rearrange the columns of the matrix similarly. 

3. Use a color scheme such that the distances between the colors 
represent the distances between the values of the elements of the 
matrix M (see Note 1). Assign to each matrix element its color 
and compute a grid visualization and (optional) dendrogram(s) 
displaying the hierarchical clustering^) for rows/columns as 
shown in Fig. 5. 

Free software to produce such visualizations is, for example, the 

R programming package [24]. 
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22 Force-Based 
Network Layout 


Biological networks are commonly represented as graphs. A graph 
G = (V,E) consists of a set of vertices V = {?i,.. .,v n ] 
representing the biological objects (e.g., proteins) and a set of 
edges E C \ v„Vj ^ V] representing the interactions 

between the biological objects (e.g., interactions between pro¬ 
teins). To visualize a graph, a layout has to be computed, that is, 
coordinates for the vertices and curves for the edges. In the follow¬ 
ing we present the force-based graph layout approach usually 
applied to biological networks. 

A force-based layout method uses a physical analogy to draw 
graphs by simulating a system of physical forces defined on the 
graph. It produces a drawing, which represents a locally minimal 
energy configuration of the physical system. Such layout methods 
are popular as they are easy to understand and implement, and give 
good visualization results. In general, force-based layout methods 
consist of two parts: (1) a system of forces defined by the vertices 
and edges, and (2) a method to find positions for the vertices 
(representing the final layout of the graph) such that for each vertex 
the total force is zero [43]. There are several frequently used 
varieties of force-based methods [44—47]. 

Here we use a force model that interprets vertices as mutually 
repulsive “particles” and edges as “springs” connecting the parti¬ 
cles. This results in attractive forces between adjacent vertices and 
repulsive forces between nonadjacent vertices. To find a locally 
minimal energy configuration iterative numerical analysis is used. 
In the final drawing, the vertices are connected by straight lines. 

For a given graph G = (V,E) the algorithm to compute a layout 
1(G) is as follows (see also Fig. 6): 


1. Place all vertices on random positions. This gives an initial layout 
/ 0 (G) (see Note 2). 

2. Repeat the following steps (steps 3 and 4) until a stop criterion 
(e.g., number of iterations or quality of current layout) is 
reached. 


For the current layout l t ( G) compute for each vertex v ^ V the 
force E(v) = / a (w, v)+ / r (w, v), which is the sum 

(u,v)^E (u,v)^VxV 

of all attractive forces/^ and all repulsive forces^ affecting v. For 
2D or 3D drawings these force vectors consist of two (xj) or 
three (x^z) components, respectively. For example, for the x 
component the forces and f r are defined as / a (#, v) = 

a *(d(u, v ) - l) and /r(«> v) = respec- 

tively, where / is the optimal distance between any pair of adja¬ 
cent vertices, d(ujp) is the current distance between the vertices 
u and r, x(u) is the x- coordinate of vertex u , x(v) is the 
v-coordinate of vertex r, and C \, r 2 are positive constants 
(see Note 3). The other components are similarly defined. 
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Fig. 6 Visualization of a graph at different steps of the force-based layout (from 
top left clockwise): initial layout, after 10, 25, and 100 iterations, respectively 

4. Move each vertex in the direction of F(v) to produce a new 
layout li+i(G) (see Note 4). 

Free software packages to produce such network layouts are, 
for example, JUNG [48], Gravisto [49], and Vanted [50]. 

2.3 SBGN Depending on the type of biological information and the level of 

detail, different SBGN languages are recommended {see Note 5). 
Process Description (PD) maps are suitable to represent the transi¬ 
tions of entities from one form or state to another with a high level 
of detail. Such maps are unambiguous, mechanistic, and sequential. 
The representation of multistate entities results in a combinatorial 
explosion leading to large and complex maps. A typical example for 
a PD map is a metabolic pathway as shown in Fig. 4. Entity 
Relationship (ER) maps show the relations between entities and 
the influence of entities upon the behavior of other entities. Such 
maps are unambiguous, mechanistic, and non-sequential. A typical 
example for an ER map is a protein interaction network. Activity 
flow (AF) maps represent the activity flow from one entity to 
another or within the same entity and provide an abstract view on 
a biological system where detailed mechanistic information is either 
not known or omitted. Such maps are ambiguous, conceptual, and 
sequential. A typical example for an AF map is a signaling pathway. 

SBGN defines a number of glyphs for the different entities and 
how these glyphs can be combined to valid SBGN maps but it does 
not outline how to embody biological knowledge. Therefore the 
SBGN bricks [51] have been introduced as a means for the 
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representation of biological knowledge in SBGN. SBGN bricks are 
building blocks representing recurring biological patterns in all 
three SBGN languages, which can be used for quick assembly 
of SBGN maps. They enable users to draw SBGN maps 
directly without the need to know all details from the SBGN 
specifications. An initial set of SBGN bricks in a wiki style format 
is available at http://sbgnbricks.sourceforge.net covering a number 
of important biological processes, which can be extended on 
demand. 

Figure 7 shows the assembly of the SBGN map from Fig. 4 
using SBGN bricks (without the additional information, see 
Note 6). The necessary steps to assemble the SBGN map are as 
follows: 

1. Choose the SBGN bricks “Catalysis—Irreversible reaction with 
2 substrates and 2 products” and “Catalysis - Irreversible reac¬ 
tion with 1 substrate and 1 product” and place them on the 
drawing area. 

2. Merge “PI” from the brick on the left with “SI” from the brick 
in the middle and merge “PI” from the brick in the middle with 
“SI” from the brick on the right. 

3. Change the labels of the three macromolecules to “hexokinase”, 
“glucose-6P isomerase”, and “phospho fructokinase” (from left 
to right). 

4. Change the labels of the simple chemicals “SI” and “PI” to 
“glucose”, “glucose 6P”, “fructose 6P”, and “fructose 1,6P” 
(from left to right). 

5. Change the label of the simple chemicals “S2” to “ATP” and 
add clone markers to indicate they appear more than once on the 
map. Change the label of the simple chemicals “P2” to “ADP” 
and add clone markers to indicate they appear more than once 
on the map. 

6. Adapt the layout of the map. 

Freely available software to produce visualizations in SBGN 
from scratch or by using the SBGN bricks is, for example, SBGN- 
ED [40] {see Note 7). A detailed step-by-step description for the 
creation of visualizations in SBGN enriched by experimental data 
similar to the SBGN map shown in Fig. 4 can be found in [52]. 


3 Notes 


1. Do not use a red-green color scheme as quite a number of 
people are red-green colorblind and therefore unable to inter¬ 
pret the visualization. 

2. The initial positions of the vertices should not be on a line. 
An alternative to a random placement is to use a given initial 
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Fig. 7 Assembly of a SBGN map using SBGN bricks (see also Fig. 4). From top to bottom : SBGN bricks 
“Catalysis—Irreversible reaction with 2 substrates and 2 products” and “Catalysis—Irreversible reaction with 
1 substrate and 1 product” placed on drawing area; “PI ” from the brick on the left merged with “SI ” from the 
brick in the middle and “PI ” from the brick in the middle merged with “SI ” from the brick on the right; labels 
of macromolecules and simple chemicals changed, clone markers added; layout of the map adapted 

layout, which then can be improved by the force-based layout 
algorithm. 

3. The parameters /, C \, and C 2 greatly affect the final drawing. A 
good way to find appropriate values for a specific graph is to 
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interactively change them during the run of the algorithm until 
appropriate interim results are obtained. 

4. It is possible to dampen the force F(v) with an increasing num¬ 
ber of iterations to allow large movements of vertices in the 
beginning and only small movements close to the end of the 
algorithm. This can help to avoid “oscillation effects” where 
vertices repeatedly jump between two positions. 

5. The SBGN specifications [31-33] provide detailed descriptions 
of any element of SBGN as well as layout rules. However, to start 
with SBGN and represent simple information such as metabolic 
or regulatory pathways, very few symbols are necessary. 

6. In SBGN maps, color does not have any meaning, and therefore, 
color can be used to express user-specific information. 

7. SBGN maps can be produced on paper or with tools. Tools such 
as SBGN-ED [40] support the creation of SBGN maps by 
validation mechanisms. 
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